Epic: Beta Launch (deploy current main, verify it holds in prod, invite the first users)

BLUE-TEAMED 2026-06-07 (drafted, fresh-agent ultracode red-team, then scope-cut with Dan). This is P0 on the roadmap (Horizon 1). It is not a build — the web app is already deployed and feature-complete at app.autri.ai. P0 is deploy current main → verify the trust-critical things hold in prod → invite the first individual users → measure. Mostly ops/verification, sequential. Ladders to North Star B4 (tiered security), B7 (dependable), B2 (know-cost-cold), B1 (trustable retrieval).

Verification stance (the spine of this epic). Every acceptance criterion names a deterministic check and an owner: 🤖 AI-mechanized (I close the loop myself before review) or 🧑 human-judgment (Dan's smoke test / qualitative call). The rule: the mechanized loop closes first; the human smoke-test is the backstop, not the primary. A criterion with no checking mechanism is itself a red-team finding.

Beta shape (locked 2026-06-07). The beta is individual users, not organizations — a couple of people from STEM Racing plus family/friends. Each signup auto-provisions its own org (the Cognito post-confirm Lambda inserts org + user + Personal library; users.organization_id = users.id = Cognito sub, a strict 1:1 self-org). So "an individual user" is a tenant — we build zero org infrastructure for the beta. The Team-tier multi-user-shared-org path (D21: library_access grants, invites, admin roles) exists in schema but the beta does not exercise it (see Non-Goals).

Features & Stories

Each feature carries a Verification mechanism line (🤖 AI / 🧑 human).

Design

Data Model Changes

None. Deploys migrations 013/014 (additive, already written). The /api/cache fix is authorization logic + presigned-URL issuance, no schema change.

Context

Overview

Goals & Non-Goals

Goals:

Deploy current main to prod (web + migrations), shipping cost-observability (D66) + UI fixes. (Not a security ship — see correction below.)
Prove the trust-critical invariants hold in prod with ≥2 real individual users: cross-user data isolation across every surface, and no silent data loss.
Confirm cost observability works on the real deployed worker (the "measure first" premise).
Seed a behavioral regression harness (F6) with the launch-critical checks — non-blocking, future-proofs the work.
Invite the first individual users and start collecting cost + quality + behavior data.

Non-Goals:

Not a feature build. No new product capability beyond what's on main (the cohort-gated source view F5 is the one conditional exception).
Not the Team-tier multi-user-org path (D21) — invites, library_access grants between users, admin roles. Beta = individual self-orgs; F2 tests cross-individual isolation only. Don't over-build org infra.
Not per-org / per-user rate or cost caps. Account-level AWS Budgets ($50/$100/$200, email-notified) are the spend brake; per-tenant metering is H2 (D18). Accepted risk for a ~10-user friendly beta: a looping user is alerted-on, not throttled.
Not commerce (Stripe/tiers/metering — H2), not MCP (cut, D56), not the Horizon-0 quality cluster, not enterprise (H3).

Problem Statement

The platform is built and deployed, but the beta is not launched because the trust-critical things have never been exercised under real conditions:

Prod is behind main — missing cost-observability. (It is not missing a tenancy fix — see the correction below; that framing was wrong.) Deploying is small but the auth stack is in a failed-deploy state that must be understood first (F1).
Cross-user isolation is enforced in code but never tested with ≥2 real users in prod. A leak between users in a trust-first beta is catastrophic — the single highest real risk. Enforced-in-code ≠ verified.
Failures may not surface. Monitoring exists, but 10 jobs sit in dead-letter queues (3 ingest + 7 extract, in ALARM since 2026-05-31, ~1 week un-actioned) — and we don't know whether those failures surfaced to the uploaders or vanished. The alarm channel is a single un-escalated email, which is why they sat unnoticed.
Cost columns are worker-written — a local CLI ingest doesn't populate them, so D66 is verified only by seed, never by a real deployed run.

What Is This Epic?

The deploy-verify-invite effort that turns a live-but-unlaunched platform into a running beta. Ship current main, run a fixed verification battery against prod (cross-user probe matrix, induced-failure surfacing, real-ingest cost), then invite the first individual users. Output = evidence the beta is safe for real user data, plus a seeded regression harness — not new product code.

Dependents

Everything in Horizon 2+ unblocks on real beta usage data this epic produces (pricing B5, API+dogfood B3/B6, enterprise B4).
Every future epic extends F6's harness with its own cases.

Dependencies

The deployed platform (EPIC-5, 2026-06-01): Cognito + Google auth, email allowlist, per-user self-orgs, multi-tenancy in code, upload→inspect→chat→cost UI. Verified, not built.
AWS self-verification access (autri-prod CLI profile, read across all 5 stacks/RDS/CloudWatch/SQS/Budgets — documented in CLAUDE.md). Makes F1/F3/F4 self-checkable. Gap: RDS is private and no SSM/bastion exists — DB-level checks need a new in-VPC read Lambda or the app's query path (F4).
deploy.sh / deploy-web.sh — deploy.sh all runs migrate→web→ingestion→monitoring and explicitly excludes the auth stack (it's drift-blocked). Migrations auto-apply via the autri-db-migrate Lambda.
CDK stacks — autri-auth-and-compute is in UPDATE_ROLLBACK_COMPLETE (failed 2026-06-01 deploy); the beta payload doesn't touch it (F1.b).
Real Cognito signups — the F2 two-user provisioning path (replaces the prod-forbidden AUTRI_DEV_AUTH).
Crucible + the existing vitest runner (incl. app/__tests__/scope.test.ts) — the F6 harness substrate.
The shipped in-app feedback→GitHub-issues button (D55) — the named beta support channel.

Current State

Prod: app.autri.ai, live. Deployed web image 0d22bbb (2026-06-02). Delta to main = 59 commits, but only 5 touch app/ (the deploy-web.sh surface); the other 54 are eval/ingestion/docs that don't deploy here. Migrations 013/014 ship via deploy.sh migrate.
Tenancy fix already live (CORRECTION). The query-playground org-scope fix (5371370, D13's "last read-leak surface") is an ancestor of the deployed image — it shipped 2026-05-29. The deploy does not ship it. (decisions.md self-contradicted on this — L184 fixed 2026-06-07. Same stale-doc trap as last session, inverted; caught by a fresh agent checking git against the live Lambda.)
Multi-tenancy (D13): read / mutation / query-playground enforcement shipped (org-scoped lookups, not-found on cross-org). 8 of 9 mutations use requireKbAccess; addDocumentsToKb uses an inline org-WHERE + worker-mode early-return (two mechanisms). Never validated by real multi-user traffic — the graduation gate.
Auth stack wedged: autri-auth-and-compute = UPDATE_ROLLBACK_COMPLETE, rollback blocked 3× by the cert export. The drift is MCP/AgentCore infra — which D56 cut from beta. CloudFormation StackDriftStatus = NOT_CHECKED (drift is asserted from memory, never measured).
Monitoring (autri-monitoring): alarms for DLQ depth, ingestion-degraded, postgres-connections, web/chat error-rate; all route to one SNS topic with a single email subscriber. RDS: single-AZ db.t4g.small (MultiAZ=false), 7-day backups + PITR, deletion-protection on. Restore (RTO) never tested.
Cost brakes exist: 3 monthly AWS Budgets ($50/$100/$200, ACTUAL+FORECASTED email alerts). No per-org/rate cap.
Cost observability (D66): rate table + per-doc stage-breakdown columns (013) + per-query Sonnet cost (014). Not deployed — rides this deploy; worker-written, so only a real deployed ingest exercises it.

Affected Systems

System / Layer	How It's Affected
`deploy.sh` / `deploy-web.sh`	The beta deploy path (web+migrations); never touches the wedged auth stack; Docker prune first (ENOSPC)
Migrations 013/014	Auto-apply via `autri-db-migrate`; verified additive (forward-only is safe)
`autri-auth-and-compute`	Health-checked (logins work post-rollback?), then a deliberate in/out-of-beta-scope call — NOT auto-touched
Multi-tenancy enforcement	Verified, not changed — probe matrix exercises every guarded surface (incl. `/api/cache`, `addDocumentsToKb`)
`/api/cache/[...path]` (page images/figures)	Gets org-authorization (currently UUID-obscurity only) — prod fix via org-scoped presigned URLs
Ingestion worker + DLQs	Cost-column writes + failure-surfacing verified; the 10 stranded messages investigated
`autri-monitoring` / SNS	Alarm→subscription wiring asserted; higher-signal route for DLQ/error-rate
Crucible / `vitest` runner	The F6 harness substrate (non-visual cases in vitest, visual in Crucible)
Inspector source pane (F5)	Only if the cohort brings non-PDF docs (more likely now — friends/family, not just STEM Racing PDFs)

Approach

Sequential, gated on the deploy. The isolation gate (F2) is decoupled from the harness build (F6) so net-new harness work never blocks launch.

Deploy current main (F1.a) — Docker prune, deploy.sh all (web + migrations), smoke. Separately, health-check the wedged auth stack (F1.b) and decide if beta needs it touched (likely not — MCP is cut).
Verify isolation (F2) — sign up 2 real individual users (real Cognito/Google, real cookies), run the cross-user probe matrix as a minimal scripted check on the real auth path. Hard launch gate — must pass before any real user data goes in. Folding it into F6 is a fast-follow, not a prerequisite.
Verify reliability (F3) — investigate the DLQ backlog, prove failures surface, confirm monitoring is enough.
Verify cost (F4) — one real ingest through the deployed worker; confirm columns populate.
(Conditional) source view (F5) — only if the cohort brings non-PDF docs.
Seed the regression harness (F6) — non-blocking; turns F2/F3/F4 checks into permanent cases.

→ Beta is "launched" when: F1–F4 green, the first users are invited (Dan), AND ≥1 real user has uploaded + chatted. Green infra alone is not "launched."

Onboarding/support (folded in, not a feature): Dan owns invitations. Support = the shipped in-app feedback→GitHub-issues button (D55) — named here, not built; someone watches the issues. First-run sanity is covered by F2.S2.1 (signing up the 2 real users is the first-run check).

Poor /hl:ship parallel-wave fit (sequential, ops, prod-facing). Execute human-led.

Key methodology — the cross-user isolation probe matrix (F2)

The invariant (D13): for every guarded surface, a request from user A for user B's resource returns "not found" — same shape as a genuinely missing resource — no existence leak, no data. Run as user A against user B's real resources, on the real authenticated path (real Cognito cookies, never mocked — AUTRI_DEV_AUTH is statically compiled out in prod and is a synthetic bypass anyway):

Surface	Probe (as user A, targeting user B's resource)	Expected
KB read (`/kb/[kbId]`)	open B's kbId	not-found, no leak
Doc read (`/docs/[id]`)	open B's docId	not-found
Chat (`/api/chat`)	query against B's kbId	not-found / no cross-user chunks
Query playground (`/docs/[id]/query`)	run query on B's docId	not-found
Mutations — `requireKbAccess` (8 of 9)	rename/delete/approve/etc. on B's kbId/docId	not-found, no mutation
Mutation — `addDocumentsToKb` (separate mechanism)	add docs to B's KB / add B's docId to your KB	not-found (probe its actual prod early-return behavior)
kbId+docId mix-and-match	your kbId, B's docId	not-found (`knowledge_base_id = kbId` defense)
`/api/cache/[...path]` (page images/figures)	GET `<B-docId>/page-1.png` while authed as A	not-found, not image bytes (the new authz — see F2.S2.4)
Retrieval (`vector`/`fts`/`lookup`)	—	org-blind by design: verified transitively via the Chat/query guards + a white-box assertion that no caller passes an unvalidated kbId (not an independent runtime probe)

Also confirm the positive case — user A fully accesses its own resources — so we're testing isolation, not a blanket 404.

F1 — Deploy current main (+ assess the wedged auth stack) → B2 · S–M · 🧱

F1.a — Routine deploy (ships the beta payload; never touches auth):

Story	Summary	Acceptance
S1.1	`cdk diff` / `detect-stack-drift` of the 5 app/ commits' deploy surface; confirm scope (not "59 commits")	Real diff artifact reviewed; drift measured not assumed
S1.2	Prune Docker; `deploy.sh all` (web + migrations 013/014)	Prod at main's SHA; migrations applied; no ENOSPC
S1.3	Post-deploy render smoke vs prod (Crucible)	Auth + upload + inspect + chat + cost-display render green
S1.4	Document + dry-run rollback (re-tag web Lambda to prior ECR image; record that 013/014 are additive → no DB rollback needed)	A tested rollback path exists, not an open question

Verification mechanism: 🤖 cdk diff + prod-SHA check + Crucible render smoke + a rollback dry-run. 🧑 Dan approves the actual deploy.sh run (outward-facing).

F1.b — Auth-stack health (investigate, then decide):

Story	Summary	Acceptance
S1.5	Confirm the wedged auth stack left logins functionally intact; decide if beta needs it reconciled	Login/signup verified working; an explicit "auth reconciliation is / isn't in beta scope" decision recorded (MCP cut → likely out)

Verification mechanism: 🤖 describe-stacks state + an end-to-end login check. 🧑 Dan's call on whether to do the manual cert-re-pin surgery now or defer (non-launch-gating).

F2 — Verify cross-user isolation in prod with ≥2 real users → B4 · S–M · 🧱 · HIGHEST RISK / LAUNCH GATE

Story	Summary	Acceptance
S2.1	Sign up 2 real individual users (real Google/Cognito) — each gets a self-org + KB + doc	Two real auth contexts; doubles as the first-run onboarding check (fresh signup lands on a usable empty state)
S2.2	Scripted cross-user probe matrix on the real cookie path	Every cross-user probe → not-found, no leak; positive (self-access) works; documented
S2.3	White-box assertion: no production caller passes an unvalidated kbId into retrieval	grep/code-review confirms retrieval's org-blindness is always guarded one layer up
S2.4	Add org-authorization to `/api/cache` (page images/figures) — resolve docId→org against the session; org-scoped presigned URLs in prod	Cross-user image GET returns not-found, not bytes; matrix row green

Verification mechanism: 🤖 the probe matrix (deterministic per cell) + the white-box grep + the /api/cache probe — re-runnable forever. 🧑 the 2-account setup is an honest human task; a one-time review that the suite hits the real path.

F3 — Verify failures surface + monitoring is sufficient → B7 · S · 🧱

Story	Summary	Acceptance
S3.1	Investigate the live DLQ backlog (3 ingest + 7 extract, in ALARM since 2026-05-31)	Root cause known (note: some are planted e2e test data); determined whether real failures surfaced or vanished
S3.2	Verify failed-ingest surfacing end-to-end	An induced bad ingest shows a terminal error state to the user; never silently vanishes or hangs in "processing"
S3.3	Confirm monitoring routing + accept availability posture	Each alarm has an SNS action with a confirmed subscriber (deterministic); a human confirms they read it; single-AZ accepted as a written decision; one backup-restore drill records RTO

Verification mechanism: 🤖 induced-failure harness case, sqs get-queue-attributes, describe-alarms AlarmActions + sns list-subscriptions-by-topic, describe-db-instances backup config. 🧑 someone confirms the alarm channel is watched (consider a higher-signal route for DLQ/error-rate); the restore drill.

F4 — Real worker cost verification → B2 · S · 🔀

Story	Summary	Acceptance
S4.1	One real ingest of a structured/figure-heavy doc (e.g. an FIA technical PDF) through the deployed worker	Per-doc cost columns + stage breakdown populate; cost is order-of-magnitude plausible for that doc type (not a blanket "S6 range"); figure-vision cost is NON-zero (the D66 regression this must prove fixed)
S4.2	One real chat query	Per-query Sonnet cost populates; bound checked against `pricing.ts` directly (S6 has no query baseline); inspector cost line renders

Verification mechanism: 🤖 read the cost columns via the app's in-VPC query path (primary — no SSM/bastion exists) or a tiny read-only in-VPC Lambda (if a DB-level assert is wanted, its own approved story); Crucible asserts the inspector cost line renders. 🧑 sanity-check the dollar figures.

F5 — (Cohort-gated) Source view for all doc types → B1 · S · 🔀

Story	Summary	Acceptance
S5.0	Determine first-cohort doc types	More likely to trigger now — friends/family may bring docx/prose, not just STEM Racing PDFs. Non-PDF present → F5 is P0; PDF-only → fast-follow
S5.1	(If triggered) render source pane for non-PDF doc types	docx/prose/markdown show a usable source view, not a dead pane

Verification mechanism: 🤖 headless Preview snapshot of a non-PDF source pane, assert non-empty. 🧑 visual judgment it's usable.

F6 — Verification & Regression Harness (cross-cutting; non-blocking; seeded) → B7, B1 · M · 🔀

Story	Summary	Acceptance
S6.1	Non-visual cases (F2 status codes, F3 terminal-state row, F4 cost-column non-null) in the existing `vitest` runner against a deployed env with real cookies; visual cases (F4 cost line, F5 source pane) in Crucible	The right tool per case type (answers the OQ#7 "is Crucible right?" — split it)
S6.2	Seed with this epic's launch-critical cases	F2 matrix, F3 surfacing, F4 cost all encoded
S6.3	Wire into the loop + convention	Runnable per-branch / pre-deploy; "add a case when you ship a feature" documented

Verification mechanism: 🤖 harness self-reports green; deterministic cases. 🧑 one-time faithfulness check (real path, not mocks) + Dan's standing smoke-test as the backstop layer.

Stories

Story	Summary	Status	PR
F1.a S1.1–S1.3	Deploy main + diff + smoke	Not started (Wave 1, human-led)
F1.a S1.4	Document + dry-run rollback	Doc MERGED (`autri-infra/docs/rollback.md`); dry-run is a Wave-1 prod action	autri-infra#4
F1.b S1.5	Auth-stack health → in/out-of-scope decision	Not started. Root cause now evidenced: `cdk diff` shows `autri-auth-and-compute` has a pending `[-]` removal of the `CertsAppCert` export that `autri-web` imports → the export-in-use deadlock behind `UPDATE_ROLLBACK_COMPLETE`. Wave-1 needs the export-severing sequence.
F2 S2.1–S2.3	Sign up 2 real users + cross-user probe matrix + white-box retrieval assertion	Not started (Wave 2, post-deploy)
F2 S2.4	`/api/cache` org-authz	MERGED to main (Wave 0b). Cross-user probe is still Wave 2.	autri#63 + autri-infra#2
F3 S3.1–S3.3	DLQ backlog + surfacing + monitoring/availability	Not started
F4 S4.1–S4.2	Real worker + query cost (figure-vision non-zero)	Not started
F5 S5.0–S5.1	(Cohort-gated) non-PDF source view	Not started
F6 S6.1–S6.3	Behavioral regression harness (non-blocking, seeded)	Not started (S6.1 scaffold = remaining Wave-0b)
(Known-issue polish)	Welcome-notification copy + dead `/help/claude-desktop` link	MERGED to main (rides Wave-1 deploy; lives in post-confirm Lambda)	autri-infra#3

S2.4 realization (session 3): instead of presigning at the inspector + chat server seams (would inject ~500-char signed URLs into the model's chat context every query + cause TTL staleness), the existing /api/cache/[...path] route became the single authz chokepoint — session + doc→org check, then in prod a 307-redirect to a short-lived presigned S3 GET; CloudFront behavior 5 + the cache-bucket OAC grant removed so there is no unauthenticated edge path. Org-authz is prod-only (CACHE_BUCKET-gated); dev still serves local files (CLI keys by filename slug, not UUID). Adversarial security review found no constructable cross-org read; cdk diff confirmed the Web + NetworkAndData changeset with autri-auth-and-compute untouched (decoupled from F1.b). The api-cache app + infra halves must deploy together (Wave 1) — see autri-infra/docs/rollback.md.

Execution Plan (waves + sequencing)

Locked with Dan 2026-06-07. The organizing principle: separate code-work (parallelizable, pre-deploy) from verification-work (post-deploy, the gate). That split is what makes Wave 0 a /hl:ship candidate while the deploy + verification waves stay human-led and sequential.

Critical-path spine: decide /api/cache fix-shape → code /api/cache + welcome fixes → merge → clean deploy → F2 (S2.1→S2.2) → invite first users. Everything else hangs off this in parallel.

Wave 0 — Pre-deploy prep (parallel; zero prod risk; the `/hl:ship`-able subset)

Two micro-stages because two tasks need a decision before coding:

0a — Decide, before coding (small, fast):

/api/cache fix-shape (resolves OQ#3). In prod, page images/figures are served S3→CloudFront with no Lambda in the read path, so editing route.ts only fixes the dev path. Real org-auth needs a design call: an authenticated endpoint issuing org-scoped presigned URLs vs. a CloudFront function / Lambda@Edge doing the check. This may touch CDK, not just the web image — so the "clean deploy" could carry an infra change. Decide the shape first.
F1.b S1.5 — auth-stack health + welcome-fix deploy path (elevated to Wave 0 because it gates the welcome fix). The welcome notification lives in the post-confirm Lambda, which is part of the wedged auth-and-compute stack. Determine whether it can be updated via a targeted aws lambda update-function-code (sidestepping the wedged CloudFormation update) or must wait on auth reconciliation. Also confirm logins work post-rollback.

0b — Code (parallel, fan-out-ready):

F2 S2.4 (code) — /api/cache org-authz fix per the 0a shape. The trust-critical one; must land in main to ride the deploy.
Welcome-message fix — repoint the notification body + link off the dead /help/claude-desktop (cut MCP feature) to a real first-run action (e.g. upload page). Deploy path per 0a.
(Optional polish, non-gating) Mascot 404 page — not-found.tsx using the WIP golden-retriever SVGs in dogs/, with the in-app feedback→GitHub button (D55) that auto-captures the attempted URL + referrer so a 404 report reads "user hit /help/claude-desktop", not a generic "broken". Closes the feedback loop on whatever lands users there. Slate if Wave 0 has room; otherwise fast-follow.
F6 S6.1 (scaffold) — stand up the harness (non-visual cases in the existing vitest runner; visual in Crucible). Non-blocking.
F1.a S1.4 — document the rollback procedure.

→ Merge Wave-0 code to main. cdk diff (F1.a S1.1) runs after this merge — the deploy now ships more than current main, so the diff must reflect what actually goes out.

Wave 1 — Deploy (sequential, human-led; the gate to prod state)

F1.a S1.2 (deploy.sh all — main + Wave-0 code + migrations 013/014; Docker prune first) → S1.3 (post-deploy Crucible render smoke). Dan approves the actual deploy. The welcome-fix and any /api/cache infra change ride here (or via the targeted Lambda update from 0a).

Wave 2 — Post-deploy verification (3 parallel tracks)

Track	Stories	Launch gate?
F2 isolation	S2.1 (sign up 2 real users) → S2.2 (cross-user probe matrix, incl. now-fixed `/api/cache`) → S2.3 (white-box retrieval assertion)	YES — hard gate
F3 reliability	S3.1 (DLQ root-cause) ∥ S3.2 (failure surfacing) ∥ S3.3 (monitoring routing + single-AZ decision + restore drill)	no
F4 cost	S4.1 (real ingest, non-zero vision cost) ∥ S4.2 (query cost)	no

F5 S5.0 (determine cohort doc types) runs anytime here; gates the conditional S5.1.

Wave 3 — Harness consolidation (non-blocking fast-follow)

F6 S6.2 (seed the verified F2/F3/F4 checks as permanent cases) → S6.3 (wire into the loop). Depends on Wave-2 checks existing; does not gate launch.

Wave 4 — Launch

Dan invites the first individual users (a couple from STEM Racing + friends/family) → confirm ≥1 real user has uploaded + chatted = "launched." F5 S5.1 only if the cohort brought non-PDF docs.

Parallelism summary

Fan-out-ready (Wave 0b code): /api/cache fix · welcome fix · 404 page · harness scaffold · rollback doc — independent; a /hl:ship wave candidate.
Sequential / human-led: the deploy (single prod action, Dan-approved); F2 S2.1→S2.2 (need users before probing); launch after F2 is green.
Parallel post-deploy: F2 / F3 / F4 tracks run concurrently; only F2 gates the invite.

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-06-07	Beta = individual users (self-orgs), no org infra built	Post-confirm Lambda already 1:1-provisions; cohort is individuals (STEM Racing few + friends/family)	Build/test Team-tier multi-user-org (D21 — deferred)
2026-06-07	F2 verified via 2 real Cognito users on the real cookie path	`AUTRI_DEV_AUTH` is prod-forbidden + a synthetic bypass; real path is the only honest check	dev-auth seam (code-disabled in prod); staging-parity (contradicts "in prod")
2026-06-07	`/api/cache` gets real org-authz (in P0)	Trust-first beta; today it's UUID-obscurity only on user document images	Accept obscurity / defer (rejected for a trust pitch)
2026-06-07	Deploy ships cost-obs, NOT a tenancy fix	`5371370` already live in prod (verified) — the urgency framing was false	— (corrected `decisions.md` L184)
2026-06-07	F6 decoupled from the F2 launch gate; non-blocking	Net-new harness build must not gate the highest-risk check; a `vitest` runner already exists	F6-before-F2 (puts build risk on critical path)
2026-06-07	Account-level Budgets are the beta spend brake; no per-org/rate cap	$50/$100/$200 budgets exist + alert; per-tenant metering is H2	Build a per-org quota (overkill for ~10 friendly users)
2026-06-07	Single-AZ RDS accepted for beta (written)	~10-user beta; deletion-protection + 7-day PITR; do one restore drill	Multi-AZ (unnecessary cost for beta)
2026-06-07	Onboarding folded in, not a feature	Dan owns invites; support = shipped feedback→GitHub button (D55); "launched" = invited AND ≥1 real upload+chat	A standalone onboarding feature (over-scoped)
2026-06-07	Execute human-led + sequential, not `/hl:ship` parallel	Ops/verification, prod-facing, interdependent	Parallel agent fan-out (mismatched)
2026-06-07 (S3)	`/api/cache` fix shape = authenticated org-scoped presigned-URL endpoint (resolves OQ#3)	Build-it-right over a temporary workaround (Dan); the app already serves private content via presigned GET (`/api/feedback/screenshot-url`) + PUT (`/api/kb/[kbId]/upload-url`), so this unifies the asset-serving model rather than adding a pattern. SDK already a dep. ~1.5–2d, low-risk. Touches CDK in `web` + `network-and-data` only — NOT the wedged auth stack (decoupled from the auth reconciliation).	Lambda@Edge viewer-auth (heaviest ops: us-east-1-only, slow global deploy+rollback; edge→RDS org lookup is an anti-pattern); main-Lambda byte passthrough (simplest, but regresses EPIC-4.5 Phase 2's deliberate no-Lambda byte path); CloudFront signed cookies (CDN-edge-cached private images, but needs keypair mgmt + org-prefixed cache keys for marginal benefit on per-user private images)
2026-06-07 (S3)	Auth stack: reconcile to clean state + full clean deploy (NOT leave wedged); welcome-fix rides that deploy (resolves OQ#1 for beta + sets the welcome-fix deploy path)	Dan: get all cloud infra correct, no tail deployment tech debt. Root cause now known = CloudFormation cross-stack export-in-use deadlock on the `app.autri.ai` ACM cert (`auth-and-compute` exports `CertsAppCert`; `web` imports it via `cdn.ts` `props.appCert`) — recurring 4× since 2026-05-27. F1.b diagnostic = `cdk diff autri-auth-and-compute`: if the pending change is just the post-confirm Lambda code (most likely) → a clean `cdk deploy` just works and the welcome-copy fix rides it; if `main` actually moved/removed the cert output → do the standard export-severing sequence (break web's import → deploy web → update auth → re-link), carefully (live cert).	Leave wedged + targeted `aws lambda update-function-code` for the welcome fix (rejected — leaves CFN drift / tail tech debt; the explicit thing Dan wants to avoid)

Risks

Risk	Likelihood	Impact	Mitigation
Cross-user data leak under real traffic	Low (enforced in code)	Catastrophic (trust-first beta)	F2 probe matrix = hard launch gate; test before any real data
`/api/cache` image leak (UUID-obscurity only)	Medium	High (user document images)	F2.S2.4 org-scoped presigned URLs
Wedged auth stack blocks/complicates deploy	Low (beta path excludes it)	Medium	F1.a never touches it; F1.b health-check + explicit scope call
The 10 stranded DLQ jobs = silent loss already happening	Medium	Trust erosion / lost data	F3.S3.1 root-cause + S3.2 surfacing fix
Cost columns uncheckable (RDS private, no bastion)	Medium	F4 blocked	App query-path primary; optional read-only in-VPC Lambda
Alarm channel missed (single email)	Medium (already happened)	Failures sit un-actioned	F3.S3.3 confirmed-subscriber assert + higher-signal route
Harness scope creep (F6)	Medium	Eats timeline	Non-blocking, seeded-only, grows per-epic
Cohort brings non-PDF docs	Medium (friends/family)	Dead source pane	F5 cohort-gating + fast-follow

Known Issues / Tech Debt

Issue	Severity	Notes
Welcome notification points beta users at Claude Desktop / MCP	Low (but first impression)	Post-confirm Lambda's `notifications` row says "Connect Autri to Claude Desktop to get started" → `/help/claude-desktop`; MCP is cut from beta (D56). One-line copy fix in `autri-infra/lambda-handlers/post-confirm`.
`addDocumentsToKb` uses a different guard mechanism than the other 8 mutations	Low	Inline org-`WHERE` + worker-mode early-return; probe separately (F2). `decisions.md` L183 annotated 2026-06-07.
CDK `StackDriftStatus = NOT_CHECKED`	Low	Drift narrative is memory-sourced; F1.S1.1 measures it for real.

Open Questions (genuinely still open)

Most prior open questions were resolved in the blue-team pass (commit scope, rollback, support, success metric, cohort shape, cost brake). Two more resolved in session 3 (2026-06-07) — see Decisions Log. Remaining:

Auth-stack reconciliation — leave wedged or fix? RESOLVED (S3): reconcile to clean state + full clean deploy (root cause = cert export-in-use deadlock; F1.b diagnostic = cdk diff autri-auth-and-compute). No tail tech debt.
Higher-signal alarm route — is email + a watched-inbox enough for beta, or wire Slack/SMS for DLQ/error-rate now? (F3.S3.3.) — still open.
~~/api/cache fix shape~~ RESOLVED (S3): authenticated org-scoped presigned-URL endpoint (unifies with existing presigned GET/PUT; touches web + network-and-data CDK only).

BLUE-TEAMED 2026-06-07. Wave-0a decisions locked session 3 (2026-06-07). Ready for Wave-0b code + human-led execution. F2 is the launch gate.

Epic: Beta Launch (deploy current main, verify it holds in prod, invite the first users)#

Features & Stories#

Design#

Data Model Changes#

Context#

Overview#

Goals & Non-Goals#

Problem Statement#

What Is This Epic?#

Dependents#

Dependencies#

Current State#

Affected Systems#

Approach#

Key methodology — the cross-user isolation probe matrix (F2)#

F1 — Deploy current main (+ assess the wedged auth stack) → B2 · S–M · 🧱#

F2 — Verify cross-user isolation in prod with ≥2 real users → B4 · S–M · 🧱 · HIGHEST RISK / LAUNCH GATE#

F3 — Verify failures surface + monitoring is sufficient → B7 · S · 🧱#

F4 — Real worker cost verification → B2 · S · 🔀#

F5 — (Cohort-gated) Source view for all doc types → B1 · S · 🔀#

F6 — Verification & Regression Harness (cross-cutting; non-blocking; seeded) → B7, B1 · M · 🔀#

Stories#

Execution Plan (waves + sequencing)#

Wave 0 — Pre-deploy prep (parallel; zero prod risk; the /hl:ship-able subset)#

Wave 1 — Deploy (sequential, human-led; the gate to prod state)#

Wave 2 — Post-deploy verification (3 parallel tracks)#

Wave 3 — Harness consolidation (non-blocking fast-follow)#

Wave 4 — Launch#

Parallelism summary#

Decisions Log#

Risks#

Known Issues / Tech Debt#

Open Questions (genuinely still open)#

Review