Autri — Functional MVP Spec

Created 2026-05-15. Scopes the Functional MVP — the smallest version of Autri that validates "is this product worth shipping?" through dogfooding by Dan and a hand-onboarded test cohort (STEM Racing kids + Dan's dad). Local-only, no commerce. The Deployed MVP (auth, onboarding for strangers, AWS deploy, Stripe, public landing) is a separate doc, deferred until Functional MVP validates.

Refined iteratively; once each H2 is tight, epic docs spin out per item.

The 5 MVP items

Each H2 below gets refined (design questions, dependencies, implementation direction) before spinning out to its own epic doc.

Goal

Validate the product locally — to Dan first, then to a hand-onboarded test cohort — before any commerce wrap. The Brehob deal outcome (post-Andy-meeting 2026-05-15) funds the commerce wrap if it lands; either way, Functional MVP comes first. A chargeable shell wrapped around a mediocre product is worse than a great product without billing.

Two MVPs

Autri ships in two distinct MVPs stacked on each other:

	Functional MVP (this doc)	Deployed MVP (separate doc, deferred)
Question it answers	Is this product worth shipping?	Can we charge for it?
Audience	Dan + hand-onboarded test cohort	First paying customers
Validation surface	Local dev environment	Production (`autri.ai`)
In scope	The 5 items below + visual QA via inspector	Auth, RLS enforcement, sample KB / onboarding for strangers, AWS-native deploy, Bedrock prod path, Stripe + tier enforcement, public landing
Test corpora	STEM Racing PDFs + Dad's QuoteAI corpus	Add: real customer corpora as they sign on

Current state — what's already built

Ingestion (D10/D11):

PDF text-layer first, vision fallback for figure-heavy/sparse pages
Operations-based extractor (LLM semantics, code mechanics) — verbatim content + bbox by construction
Haiku default, Sonnet retry button per-doc
Continue-on-error per-file

Retrieval (D4/D14):

Three KB-scoped primitives: lookup_section, fts_search, vector_search (in @autri/retrieval)
Agentic router via MCP (stdio for dev): @autri/mcp-doc-search
retrieval_log table powers source-of-result attribution

UI:

/docs — flat doc list with confidence tier chips
/docs/[id] — inspector with bbox overlays + parsed text
/docs/[id]/query — query playground with color-coded source-of-result traces
/kb — KB list
/kb/[id] — doc list within a KB
/kb/[id]/chat — chat surface with markdown answers, source chips, bbox preview

Schema (D13/D14/D15):

Multi-tenancy FK chain
Logical-documents + supersession + default-latest filter

Pilot state: 1272 chunks across 2 STEM Racing PDF docs. Default org Hannah Labs, default KB STEM Racing Charlotte.

Test cohort and corpora

Two cohorts, two corpus shapes — chosen for maximum surface coverage with zero new-acquisition cost. Compounds across QuoteAI (Dad's corpus is already ingested there) and Autri.

STEM Racing kids (Charlotte team):

Corpus: World Finals Technical + Competition Regulations (already ingested — 1272 chunks)
Content shape: figure-heavy PDFs, technical regulations
Query pattern: rule lookup + creative-interpretation ("can we use X mod given C7.6.2?")
Tests: PDF path + figure handling + rule-lookup retrieval

Dad's QuoteAI corpus (Brehob quote history):

Subset (a): structured quote spreadsheets (XLSX) — past-quote line-items in QuoteAI's format
Subset (b): raw past-quote PDFs — the originals before being spreadsheet-ified
Content shape: business documents, table-heavy, figure-light
Query pattern: "what did we quote for similar scenarios?" semantic search across past work
Tests: XLSX path + PDF path on totally different content type than STEM (cross-product compound)

Together: two real users with two distinct content shapes, four source-type exercises (PDF technical, PDF business, XLSX structured, eventually DOCX as item 3 expands).

1. UI/UX flow

Problem. A lot of functionality exists — inspector, chat, KB nav, query playground — but the UX flow doesn't yet tell a real user "the system is working and I trust it." Today ingestion is CLI-only, navigation is a flat doc list, retrieval traces only surface on the playground page (not chat).

MVP needs:

Chat as the homepage.

Familiar webchat pattern (Claude-web-chat feel). Center column for chat; right rail for sources (initial lean — iterate until it feels right).
Color-coded retrieval traces surfaced inline with each source — the differentiator can't live only on the playground page.
Multi-turn chat history threaded into the router.
Inline [N] citations rendered in the assistant answer body.
Empty state: if user has zero KBs, redirect to the KB upload page (or block chat input with a "create a KB first" prompt).

Knowledge base management view.

List existing KBs, create new KB, delete KB (no per-KB ACL UI in Functional MVP — that's Deployed).
Click into a KB → high-level source-doc view (today's /kb/[id], polished).

Ingestion pipeline view (the cool one).

Pipeline-style flow visualization with stages: File upload → Ingestion → Agent validation → Human review → Ready
Pattern reference: QuoteAI's streaming-checklist for quote generation — gives the user confidence the system is doing real work.
Status doesn't need to be real-time-streaming (polling is fine); a status bar that updates per stage is enough.
Per-file failure visibility — when one doc fails (vision timeout, parse error, schema validation failure), show why without blowing up the batch.

Source-doc drill-in.

Today's /docs/[id] inspector is the core surface — polished.
Need: navigate within a doc (page-by-page, section-by-section).

Retrieval trace surface inside chat.

The color-coded "which index returned this chunk" visualization is a first-class part of the chat UX, not a separate playground page. Trust comes from legibility.

Empty states + failure UX.

No KBs yet → routed to upload page
KB has no docs → empty state with "Upload your first document" CTA
Ingestion failures → per-doc error display, retry, partial-success
LLM returns invalid JSON → automatic Sonnet retry, fall through to "needs human review" tier

Branding/visual design: Claude Design handles the visual layer (typography, color, layout polish) in parallel.

Design questions to refine:

(a) Right-rail sources or different layout? Initial lean is right-rail (matches QuoteAI's rail pattern, keeps chat as focal column). Iterate until it feels right.
(b) Async ingestion: poll or SSE? Probably polling for v1 (one less moving part); upgrade to SSE if it feels slow.
(c) Bulk upload of related docs. If user drops 50 files, treat as 50 separate docs or one logical doc with 50 sections? Probably 50 separate docs.

Dependencies: none — parallelizable with the extraction spike. Out of scope (Deployed MVP or later): onboarding for strangers, sample KB / try-without-uploading, per-KB ACLs, mobile responsiveness.

2. MCP over SSE + OAuth

Problem. Current MCP is stdio (@autri/mcp-doc-search). Stdio is local-only — won't survive deploy. The strategic positioning ("be the MCP server they consume") demands hosted SSE. Per D5 pruned-note: stdio MCP is planned for retirement in favor of SSE+OAuth.

MVP needs:

SSE transport for the doc-search server (retire stdio)
OAuth 2.0 flow for token issuance — user clicks "Connect Claude / Cursor / Whatever" → consent → token issued
Token scope model: per-token list of allowed KB IDs + allowed tools
Token management UI in-app: list active, revoke, re-scope
Lift E12 wholesale from Foundry — already designed there

Design questions to refine:

(a) Local dev OAuth: real Cognito dev pool, or stub? Lean: real Cognito dev pool — normalizes the auth pathway from day one (~1-2 day setup), no two-pathway drift. Reuses primitive between Functional and Deployed MVP.
(b) Tool surface on the MCP server. Add list_knowledge_bases and list_documents to support D17 (hybrid agent + KB selection). Figure access is v1.5.
(c) Per-tool authorization. Read-only vs. write tokens? Probably yes — defaults to read-only.

Dependencies: local OAuth setup (real Cognito or stub) for issuance. Out of scope: rate-limiting per-token (Deployed MVP), audit log surface, OAuth client registration UI.

3. Multi-doc-type extraction (spike-and-iterate)

Problem. Today's extractor is PDF-vision-first. DOCX/XLSX/MD need different parsing approaches. Figures/diagrams in PDFs are bbox-overlaid but lack semantic content.

Approach: spike first, design after. Per process.md: "Make design decisions during implementation as they surface — that's when the real constraints are visible." Trying to design the perfect agent.md schema before we've shipped DOCX is exactly the kind of design-before-real-constraints the methodology warns against.

Spike plan:

Refactor current PDF extractor into extractors/pdf/ — defines what a doc-type extractor IS structurally.
Build DOCX as the second type — validates that the abstraction holds.
Iterate against the test corpora (STEM Racing + Dad's quotes) — what works, what doesn't, what cross-type abstraction emerges.
Once two types ship, write up the abstraction that actually emerged (not the one we guessed). That writeup becomes the proper design doc.
Then expand to Markdown/plain text and XLSX.

Source types in priority order:

PDF (refactor existing → first source-type-specific extractor)
DOCX (validates abstraction; key for author segment + Dad's templates)
Plain text + Markdown (trivial off DOCX; covers dev/prosumer)
XLSX (Dad's quote spreadsheets — exercises a totally different shape than text-flow docs)

Agent-validation pipeline stage (new).

Today's pipeline: extract → finalize (compute confidence tier) → human review.

Proposed new stage between extract and human review: agent validation.

LLM reads extracted chunks against the source pages
Flags suspicious chunks for human review (hallucinated content, structurally wrong, semantically off)
Auto-approves high-confidence chunks
Tightens the human-review loop — humans only see what needs human judgment

This complements D11 (operations-based extractor → verbatim text by construction) by catching errors in semantic chunking even when the text is verbatim.

Figures/pictures in PDFs (sub-problem):

Today: figures are bbox-overlaid; chunks referencing them have no semantic content of their own
Proposed: describe_figure operation per figure region → Haiku vision generates text description → stored as chunk_type = 'figure_description', embedded with surrounding text context
Cost: ~$0.0003 per figure (tractable)
Spike-test this alongside the doc-type work

Open during the spike (not pre-decided):

agent.md schema (structured vs. freeform)
DOCX chunking strategy (paragraph-bound, heading-bound, hybrid)
XLSX semantics (named ranges, detected tables, cell-level)
Figure description embedding (text-only with context vs. multi-modal CLIP-style — text-only likely right)

Dependencies: test corpora (already have). Out of scope: OCR'd PDFs (image-only, no text layer), HTML/URL ingestion, audio/video.

4. KB update flow

Problem. Today: manual one-doc-at-a-time CLI ingest. Authors and legal teams accumulate docs continuously; they need a recurring update path. (Manual upload UI lands as part of item 1; the more nuanced version-detection lives here.)

MVP needs:

Manual upload covers Functional MVP (drag-drop in item 1's surface)
D15 version-detection heuristic on upload — filename + title + structural overlap auto-supersedes prior versions
Update vs. supersede UX — when uploading a new doc that matches an existing logical-name, surface "looks like a new version of X — supersede?" prompt; don't auto-supersede silently
Re-extract trigger — button on doc inspector to re-run extraction with the current extractor version (useful as the extractor improves during the spike)

Design questions:

(a) Manual confirm vs. auto-supersede. Lean: confirm. Cheaper to be wrong than to silently overwrite.
(b) Bulk re-extract. If extractor improves significantly mid-spike, offer "re-extract all docs in this KB"? Probably yes as a doc-level action; surface in the inspector.

Dependencies: D15 version-detection heuristic needs implementation (designed in Foundry, not built). Out of scope (post-MVP / Deployed MVP): Google Drive folder sync, Dropbox sync, SharePoint, S3 bucket, Git repo sync, scheduled refreshes, webhook-driven ingest.

5. File diff mechanism

Problem. When v2 of a doc is uploaded, show "what changed since v1." Schema is in (D15); algorithm + UI are not.

Approach: chunk-level structural diff, not git diff.

Git-diff is the wrong primitive here. Git diffs are line-based on text; our content is chunk-based with embeddings and bboxes; PDF binaries don't diff. Instead:

For each new version, run version-detection heuristic (D15) → match to existing logical_document
For each matched logical doc, run chunk-level diff:
- For each new chunk, find nearest neighbor in prior version (cosine distance + section ID match)
- High similarity (>0.95, same section ID) → unchanged
- Medium → changed (render side-by-side text diff)
- No match → added
- Old chunks unmatched → removed
Surface in /kb/[id]/doc/[id]/diff?compare=v1,v2: green/yellow/red sections

Git-diff IS useful for the content of a single matched chunk — line-by-line git-style diff for inner-chunk rendering. Git-diff for inner-chunk view, structural diff for inter-chunk view.

Build sequence:

Version-detection heuristic on upload (part of item 4)
Chunk-level diff algorithm
/diff UI

Design questions:

(a) Diff scope. Just v_prev vs v_current, or arbitrary version-pair (v1 vs v5)? Lean: arbitrary-pair.
(b) Inner-chunk rendering. Word-level vs. line-level diff? Probably both — toggle in UI.

Dependencies: item 4 (need multiple versions to diff); item 1 (UI home for the diff view). Out of scope: three-way merge, branch-based collaboration, semantic-change classification.

Sequencing — order of attack

Most items parallelize cleanly.

Parallel track A — Visual design (Claude Design): UI/UX visual design for item 1 — chat pattern, KB management, ingestion pipeline view, doc drill-in.

Parallel track B — Extraction spike (Dan): Item 3 spike-and-iterate against test corpora. PDF refactor first, then DOCX. Iterate. Write up emergent abstraction.

Parallel track C — MCP foundation (whenever): Item 2 — independent of the others. Probably wait until Cognito dev pool setup is convenient anyway.

Serial after tracks A + B converge:

Item 1 implementation (UX flow) — uses both the visual design and the now-shipped multi-doc-type extractor
Item 4 (KB update flow) — needs item 1's upload UI shipped
Item 5 (file diff) — needs item 4's multiple-version state

What's NOT in Functional MVP

Explicitly cut from Functional MVP:

Moved to Deployed MVP:

Auth + multi-tenancy enforcement (RLS policies, scope-passing contract)
Onboarding for strangers (sample KB, "try without uploading," first-run tour)
AWS-native deploy (ECS, RDS, S3, CloudFront)
Bedrock prod path swap (D16)
Stripe + tier enforcement
Public landing page
Per-KB ACLs / sharing

Post-MVP (after Deployed MVP):

Quality baseline / regression-test loop (visual QA via inspector + chat is enough for Functional MVP)
Auto-sync integrations (Google Drive, Dropbox, SharePoint, S3, Git)
Data export / KB dump
HTML/URL ingestion
OCR for image-only PDFs
Multi-modal embeddings beyond text
Pro tier, Studio tier, Enterprise features
Hover popovers on source chips, reciprocal rank fusion, pgvector tuning beyond lists=20

Success criteria — how do we know Functional MVP is done?

Concrete behaviors that count as success:

Inspector trust. Dan drops a fresh 50-page PDF, watches the ingestion pipeline view, and trusts the extraction without manual chunk review. Color-coded retrieval traces explain every chunk surfaced in chat.
Multi-doc-type breadth. PDFs, DOCX, MD/plain text work cleanly. XLSX is a stretch goal but probably done given Dad's spreadsheets.
Hosted MCP works. Dan connects Claude Desktop / Cursor / Claude Code to a local Autri MCP server via OAuth, scopes a token to one KB, and runs successful agent queries.
STEM Racing kids cohort. At least one kid imports the regs (or uses the existing KB) and runs 5 creative-interpretation queries with correctly cited sources, without Dan's intervention.
Dad cohort. Dad's spreadsheets and raw PDFs both ingest cleanly; he runs at least 3 "what did we quote for similar scenarios?" queries that return useful results (validates the cross-product compound thesis).
Versioning + diff works. Dan re-uploads a STEM Racing reg with edits and the diff view shows the changes correctly across both inter-chunk and inner-chunk views.

If 5 of 6 land, Functional MVP is done. Move to Deployed MVP.

Open questions / refinement areas

Highest-leverage refinement areas across items:

The right-rail-vs-other-layout decision (item 1) — drives the rest of the navigation model.
Agent-validation prompt design (item 3) — what does the validator look for? Spike will tell us.
Local OAuth approach (item 2) — drives how much Deployed-MVP foundation we accidentally land in Functional MVP.

These three set cones for everything else.

Next steps

Refine this doc together (this session + future sessions). Surface design questions as we hit them; iterate.
Once H2s are tight, spin out epic docs per item: projects/autri/epics/e1-ui-ux.md, e2-mcp-sse-oauth.md, e3-multi-doc-type.md, e4-update-flow.md, e5-file-diff.md.
Each epic gets implementation-grade detail before any feature branch opens.
Claude Design takes UI/UX visual design as a parallel track.

Autri — Functional MVP Spec#

The 5 MVP items#

Goal#

Two MVPs#

Current state — what's already built#

Test cohort and corpora#

1. UI/UX flow#

2. MCP over SSE + OAuth#

3. Multi-doc-type extraction (spike-and-iterate)#

4. KB update flow#

5. File diff mechanism#

Sequencing — order of attack#

What's NOT in Functional MVP#

Success criteria — how do we know Functional MVP is done?#

Open questions / refinement areas#

Next steps#

Review

Autri — Functional MVP Spec

The 5 MVP items

Goal

Two MVPs

Current state — what's already built

Test cohort and corpora

1. UI/UX flow

2. MCP over SSE + OAuth

3. Multi-doc-type extraction (spike-and-iterate)

4. KB update flow

5. File diff mechanism

Sequencing — order of attack

What's NOT in Functional MVP

Success criteria — how do we know Functional MVP is done?

Open questions / refinement areas

Next steps