Foundry Foundry

Epic: Beta Launch (deploy current main, verify it holds in prod, invite the first users)

BLUE-TEAMED 2026-06-07 (drafted, fresh-agent ultracode red-team, then scope-cut with Dan). This is P0 on the roadmap (Horizon 1). It is not a build — the web app is already deployed and feature-complete at app.autri.ai. P0 is deploy current main → verify the trust-critical things hold in prod → invite the first individual users → measure. Mostly ops/verification, sequential. Ladders to North Star B4 (tiered security), B7 (dependable), B2 (know-cost-cold), B1 (trustable retrieval).

Verification stance (the spine of this epic). Every acceptance criterion names a deterministic check and an owner: 🤖 AI-mechanized (I close the loop myself before review) or 🧑 human-judgment (Dan's smoke test / qualitative call). The rule: the mechanized loop closes first; the human smoke-test is the backstop, not the primary. A criterion with no checking mechanism is itself a red-team finding.

Beta shape (locked 2026-06-07). The beta is individual users, not organizations — a couple of people from STEM Racing plus family/friends. Each signup auto-provisions its own org (the Cognito post-confirm Lambda inserts org + user + Personal library; users.organization_id = users.id = Cognito sub, a strict 1:1 self-org). So "an individual user" is a tenant — we build zero org infrastructure for the beta. The Team-tier multi-user-shared-org path (D21: library_access grants, invites, admin roles) exists in schema but the beta does not exercise it (see Non-Goals).

Features & Stories

Each feature carries a Verification mechanism line (🤖 AI / 🧑 human).

Design

Data Model Changes

None. Deploys migrations 013/014 (additive, already written). The /api/cache fix is authorization logic + presigned-URL issuance, no schema change.


Context

Overview

Goals & Non-Goals

Goals:

  • Deploy current main to prod (web + migrations), shipping cost-observability (D66) + UI fixes. (Not a security ship — see correction below.)
  • Prove the trust-critical invariants hold in prod with ≥2 real individual users: cross-user data isolation across every surface, and no silent data loss.
  • Confirm cost observability works on the real deployed worker (the "measure first" premise).
  • Seed a behavioral regression harness (F6) with the launch-critical checks — non-blocking, future-proofs the work.
  • Invite the first individual users and start collecting cost + quality + behavior data.

Non-Goals:

  • Not a feature build. No new product capability beyond what's on main (the cohort-gated source view F5 is the one conditional exception).
  • Not the Team-tier multi-user-org path (D21) — invites, library_access grants between users, admin roles. Beta = individual self-orgs; F2 tests cross-individual isolation only. Don't over-build org infra.
  • Not per-org / per-user rate or cost caps. Account-level AWS Budgets ($50/$100/$200, email-notified) are the spend brake; per-tenant metering is H2 (D18). Accepted risk for a ~10-user friendly beta: a looping user is alerted-on, not throttled.
  • Not commerce (Stripe/tiers/metering — H2), not MCP (cut, D56), not the Horizon-0 quality cluster, not enterprise (H3).

Problem Statement

The platform is built and deployed, but the beta is not launched because the trust-critical things have never been exercised under real conditions:

  1. Prod is behind main — missing cost-observability. (It is not missing a tenancy fix — see the correction below; that framing was wrong.) Deploying is small but the auth stack is in a failed-deploy state that must be understood first (F1).
  2. Cross-user isolation is enforced in code but never tested with ≥2 real users in prod. A leak between users in a trust-first beta is catastrophic — the single highest real risk. Enforced-in-code ≠ verified.
  3. Failures may not surface. Monitoring exists, but 10 jobs sit in dead-letter queues (3 ingest + 7 extract, in ALARM since 2026-05-31, ~1 week un-actioned) — and we don't know whether those failures surfaced to the uploaders or vanished. The alarm channel is a single un-escalated email, which is why they sat unnoticed.
  4. Cost columns are worker-written — a local CLI ingest doesn't populate them, so D66 is verified only by seed, never by a real deployed run.

What Is This Epic?

The deploy-verify-invite effort that turns a live-but-unlaunched platform into a running beta. Ship current main, run a fixed verification battery against prod (cross-user probe matrix, induced-failure surfacing, real-ingest cost), then invite the first individual users. Output = evidence the beta is safe for real user data, plus a seeded regression harness — not new product code.


Dependents

  • Everything in Horizon 2+ unblocks on real beta usage data this epic produces (pricing B5, API+dogfood B3/B6, enterprise B4).
  • Every future epic extends F6's harness with its own cases.

Dependencies

  • The deployed platform (EPIC-5, 2026-06-01): Cognito + Google auth, email allowlist, per-user self-orgs, multi-tenancy in code, upload→inspect→chat→cost UI. Verified, not built.
  • AWS self-verification access (autri-prod CLI profile, read across all 5 stacks/RDS/CloudWatch/SQS/Budgets — documented in CLAUDE.md). Makes F1/F3/F4 self-checkable. Gap: RDS is private and no SSM/bastion exists — DB-level checks need a new in-VPC read Lambda or the app's query path (F4).
  • deploy.sh / deploy-web.shdeploy.sh all runs migrate→web→ingestion→monitoring and explicitly excludes the auth stack (it's drift-blocked). Migrations auto-apply via the autri-db-migrate Lambda.
  • CDK stacksautri-auth-and-compute is in UPDATE_ROLLBACK_COMPLETE (failed 2026-06-01 deploy); the beta payload doesn't touch it (F1.b).
  • Real Cognito signups — the F2 two-user provisioning path (replaces the prod-forbidden AUTRI_DEV_AUTH).
  • Crucible + the existing vitest runner (incl. app/__tests__/scope.test.ts) — the F6 harness substrate.
  • The shipped in-app feedback→GitHub-issues button (D55) — the named beta support channel.

Current State

  • Prod: app.autri.ai, live. Deployed web image 0d22bbb (2026-06-02). Delta to main = 59 commits, but only 5 touch app/ (the deploy-web.sh surface); the other 54 are eval/ingestion/docs that don't deploy here. Migrations 013/014 ship via deploy.sh migrate.
  • Tenancy fix already live (CORRECTION). The query-playground org-scope fix (5371370, D13's "last read-leak surface") is an ancestor of the deployed image — it shipped 2026-05-29. The deploy does not ship it. (decisions.md self-contradicted on this — L184 fixed 2026-06-07. Same stale-doc trap as last session, inverted; caught by a fresh agent checking git against the live Lambda.)
  • Multi-tenancy (D13): read / mutation / query-playground enforcement shipped (org-scoped lookups, not-found on cross-org). 8 of 9 mutations use requireKbAccess; addDocumentsToKb uses an inline org-WHERE + worker-mode early-return (two mechanisms). Never validated by real multi-user traffic — the graduation gate.
  • Auth stack wedged: autri-auth-and-compute = UPDATE_ROLLBACK_COMPLETE, rollback blocked 3× by the cert export. The drift is MCP/AgentCore infra — which D56 cut from beta. CloudFormation StackDriftStatus = NOT_CHECKED (drift is asserted from memory, never measured).
  • Monitoring (autri-monitoring): alarms for DLQ depth, ingestion-degraded, postgres-connections, web/chat error-rate; all route to one SNS topic with a single email subscriber. RDS: single-AZ db.t4g.small (MultiAZ=false), 7-day backups + PITR, deletion-protection on. Restore (RTO) never tested.
  • Cost brakes exist: 3 monthly AWS Budgets ($50/$100/$200, ACTUAL+FORECASTED email alerts). No per-org/rate cap.
  • Cost observability (D66): rate table + per-doc stage-breakdown columns (013) + per-query Sonnet cost (014). Not deployed — rides this deploy; worker-written, so only a real deployed ingest exercises it.

Affected Systems

System / LayerHow It's Affected
deploy.sh / deploy-web.shThe beta deploy path (web+migrations); never touches the wedged auth stack; Docker prune first (ENOSPC)
Migrations 013/014Auto-apply via autri-db-migrate; verified additive (forward-only is safe)
autri-auth-and-computeHealth-checked (logins work post-rollback?), then a deliberate in/out-of-beta-scope call — NOT auto-touched
Multi-tenancy enforcementVerified, not changed — probe matrix exercises every guarded surface (incl. /api/cache, addDocumentsToKb)
/api/cache/[...path] (page images/figures)Gets org-authorization (currently UUID-obscurity only) — prod fix via org-scoped presigned URLs
Ingestion worker + DLQsCost-column writes + failure-surfacing verified; the 10 stranded messages investigated
autri-monitoring / SNSAlarm→subscription wiring asserted; higher-signal route for DLQ/error-rate
Crucible / vitest runnerThe F6 harness substrate (non-visual cases in vitest, visual in Crucible)
Inspector source pane (F5)Only if the cohort brings non-PDF docs (more likely now — friends/family, not just STEM Racing PDFs)

Approach

Sequential, gated on the deploy. The isolation gate (F2) is decoupled from the harness build (F6) so net-new harness work never blocks launch.

  1. Deploy current main (F1.a) — Docker prune, deploy.sh all (web + migrations), smoke. Separately, health-check the wedged auth stack (F1.b) and decide if beta needs it touched (likely not — MCP is cut).
  2. Verify isolation (F2) — sign up 2 real individual users (real Cognito/Google, real cookies), run the cross-user probe matrix as a minimal scripted check on the real auth path. Hard launch gate — must pass before any real user data goes in. Folding it into F6 is a fast-follow, not a prerequisite.
  3. Verify reliability (F3) — investigate the DLQ backlog, prove failures surface, confirm monitoring is enough.
  4. Verify cost (F4) — one real ingest through the deployed worker; confirm columns populate.
  5. (Conditional) source view (F5) — only if the cohort brings non-PDF docs.
  6. Seed the regression harness (F6) — non-blocking; turns F2/F3/F4 checks into permanent cases.

Beta is "launched" when: F1–F4 green, the first users are invited (Dan), AND ≥1 real user has uploaded + chatted. Green infra alone is not "launched."

Onboarding/support (folded in, not a feature): Dan owns invitations. Support = the shipped in-app feedback→GitHub-issues button (D55) — named here, not built; someone watches the issues. First-run sanity is covered by F2.S2.1 (signing up the 2 real users is the first-run check).

Poor /hl:ship parallel-wave fit (sequential, ops, prod-facing). Execute human-led.

Key methodology — the cross-user isolation probe matrix (F2)

The invariant (D13): for every guarded surface, a request from user A for user B's resource returns "not found" — same shape as a genuinely missing resource — no existence leak, no data. Run as user A against user B's real resources, on the real authenticated path (real Cognito cookies, never mocked — AUTRI_DEV_AUTH is statically compiled out in prod and is a synthetic bypass anyway):

SurfaceProbe (as user A, targeting user B's resource)Expected
KB read (/kb/[kbId])open B's kbIdnot-found, no leak
Doc read (/docs/[id])open B's docIdnot-found
Chat (/api/chat)query against B's kbIdnot-found / no cross-user chunks
Query playground (/docs/[id]/query)run query on B's docIdnot-found
Mutations — requireKbAccess (8 of 9)rename/delete/approve/etc. on B's kbId/docIdnot-found, no mutation
Mutation — addDocumentsToKb (separate mechanism)add docs to B's KB / add B's docId to your KBnot-found (probe its actual prod early-return behavior)
kbId+docId mix-and-matchyour kbId, B's docIdnot-found (knowledge_base_id = kbId defense)
/api/cache/[...path] (page images/figures)GET <B-docId>/page-1.png while authed as Anot-found, not image bytes (the new authz — see F2.S2.4)
Retrieval (vector/fts/lookup)org-blind by design: verified transitively via the Chat/query guards + a white-box assertion that no caller passes an unvalidated kbId (not an independent runtime probe)

Also confirm the positive case — user A fully accesses its own resources — so we're testing isolation, not a blanket 404.

F1 — Deploy current main (+ assess the wedged auth stack) → B2 · S–M · 🧱

F1.a — Routine deploy (ships the beta payload; never touches auth):

StorySummaryAcceptance
S1.1cdk diff / detect-stack-drift of the 5 app/ commits' deploy surface; confirm scope (not "59 commits")Real diff artifact reviewed; drift measured not assumed
S1.2Prune Docker; deploy.sh all (web + migrations 013/014)Prod at main's SHA; migrations applied; no ENOSPC
S1.3Post-deploy render smoke vs prod (Crucible)Auth + upload + inspect + chat + cost-display render green
S1.4Document + dry-run rollback (re-tag web Lambda to prior ECR image; record that 013/014 are additive → no DB rollback needed)A tested rollback path exists, not an open question

Verification mechanism: 🤖 cdk diff + prod-SHA check + Crucible render smoke + a rollback dry-run. 🧑 Dan approves the actual deploy.sh run (outward-facing).

F1.b — Auth-stack health (investigate, then decide):

StorySummaryAcceptance
S1.5Confirm the wedged auth stack left logins functionally intact; decide if beta needs it reconciledLogin/signup verified working; an explicit "auth reconciliation is / isn't in beta scope" decision recorded (MCP cut → likely out)

Verification mechanism: 🤖 describe-stacks state + an end-to-end login check. 🧑 Dan's call on whether to do the manual cert-re-pin surgery now or defer (non-launch-gating).

F2 — Verify cross-user isolation in prod with ≥2 real users → B4 · S–M · 🧱 · HIGHEST RISK / LAUNCH GATE

StorySummaryAcceptance
S2.1Sign up 2 real individual users (real Google/Cognito) — each gets a self-org + KB + docTwo real auth contexts; doubles as the first-run onboarding check (fresh signup lands on a usable empty state)
S2.2Scripted cross-user probe matrix on the real cookie pathEvery cross-user probe → not-found, no leak; positive (self-access) works; documented
S2.3White-box assertion: no production caller passes an unvalidated kbId into retrievalgrep/code-review confirms retrieval's org-blindness is always guarded one layer up
S2.4Add org-authorization to /api/cache (page images/figures) — resolve docId→org against the session; org-scoped presigned URLs in prodCross-user image GET returns not-found, not bytes; matrix row green

Verification mechanism: 🤖 the probe matrix (deterministic per cell) + the white-box grep + the /api/cache probe — re-runnable forever. 🧑 the 2-account setup is an honest human task; a one-time review that the suite hits the real path.

F3 — Verify failures surface + monitoring is sufficient → B7 · S · 🧱

StorySummaryAcceptance
S3.1Investigate the live DLQ backlog (3 ingest + 7 extract, in ALARM since 2026-05-31)Root cause known (note: some are planted e2e test data); determined whether real failures surfaced or vanished
S3.2Verify failed-ingest surfacing end-to-endAn induced bad ingest shows a terminal error state to the user; never silently vanishes or hangs in "processing"
S3.3Confirm monitoring routing + accept availability postureEach alarm has an SNS action with a confirmed subscriber (deterministic); a human confirms they read it; single-AZ accepted as a written decision; one backup-restore drill records RTO

Verification mechanism: 🤖 induced-failure harness case, sqs get-queue-attributes, describe-alarms AlarmActions + sns list-subscriptions-by-topic, describe-db-instances backup config. 🧑 someone confirms the alarm channel is watched (consider a higher-signal route for DLQ/error-rate); the restore drill.

F4 — Real worker cost verification → B2 · S · 🔀

StorySummaryAcceptance
S4.1One real ingest of a structured/figure-heavy doc (e.g. an FIA technical PDF) through the deployed workerPer-doc cost columns + stage breakdown populate; cost is order-of-magnitude plausible for that doc type (not a blanket "S6 range"); figure-vision cost is NON-zero (the D66 regression this must prove fixed)
S4.2One real chat queryPer-query Sonnet cost populates; bound checked against pricing.ts directly (S6 has no query baseline); inspector cost line renders

Verification mechanism: 🤖 read the cost columns via the app's in-VPC query path (primary — no SSM/bastion exists) or a tiny read-only in-VPC Lambda (if a DB-level assert is wanted, its own approved story); Crucible asserts the inspector cost line renders. 🧑 sanity-check the dollar figures.

F5 — (Cohort-gated) Source view for all doc types → B1 · S · 🔀

StorySummaryAcceptance
S5.0Determine first-cohort doc typesMore likely to trigger now — friends/family may bring docx/prose, not just STEM Racing PDFs. Non-PDF present → F5 is P0; PDF-only → fast-follow
S5.1(If triggered) render source pane for non-PDF doc typesdocx/prose/markdown show a usable source view, not a dead pane

Verification mechanism: 🤖 headless Preview snapshot of a non-PDF source pane, assert non-empty. 🧑 visual judgment it's usable.

F6 — Verification & Regression Harness (cross-cutting; non-blocking; seeded) → B7, B1 · M · 🔀

StorySummaryAcceptance
S6.1Non-visual cases (F2 status codes, F3 terminal-state row, F4 cost-column non-null) in the existing vitest runner against a deployed env with real cookies; visual cases (F4 cost line, F5 source pane) in CrucibleThe right tool per case type (answers the OQ#7 "is Crucible right?" — split it)
S6.2Seed with this epic's launch-critical casesF2 matrix, F3 surfacing, F4 cost all encoded
S6.3Wire into the loop + conventionRunnable per-branch / pre-deploy; "add a case when you ship a feature" documented

Verification mechanism: 🤖 harness self-reports green; deterministic cases. 🧑 one-time faithfulness check (real path, not mocks) + Dan's standing smoke-test as the backstop layer.


Stories

StorySummaryStatusPR
F1.a S1.1–S1.3Deploy main + diff + smokeNot started (Wave 1, human-led)
F1.a S1.4Document + dry-run rollbackDoc MERGED (autri-infra/docs/rollback.md); dry-run is a Wave-1 prod actionautri-infra#4
F1.b S1.5Auth-stack health → in/out-of-scope decisionNot started. Root cause now evidenced: cdk diff shows autri-auth-and-compute has a pending [-] removal of the CertsAppCert export that autri-web imports → the export-in-use deadlock behind UPDATE_ROLLBACK_COMPLETE. Wave-1 needs the export-severing sequence.
F2 S2.1–S2.3Sign up 2 real users + cross-user probe matrix + white-box retrieval assertionNot started (Wave 2, post-deploy)
F2 S2.4/api/cache org-authzMERGED to main (Wave 0b). Cross-user probe is still Wave 2.autri#63 + autri-infra#2
F3 S3.1–S3.3DLQ backlog + surfacing + monitoring/availabilityNot started
F4 S4.1–S4.2Real worker + query cost (figure-vision non-zero)Not started
F5 S5.0–S5.1(Cohort-gated) non-PDF source viewNot started
F6 S6.1–S6.3Behavioral regression harness (non-blocking, seeded)Not started (S6.1 scaffold = remaining Wave-0b)
(Known-issue polish)Welcome-notification copy + dead /help/claude-desktop linkMERGED to main (rides Wave-1 deploy; lives in post-confirm Lambda)autri-infra#3

S2.4 realization (session 3): instead of presigning at the inspector + chat server seams (would inject ~500-char signed URLs into the model's chat context every query + cause TTL staleness), the existing /api/cache/[...path] route became the single authz chokepoint — session + doc→org check, then in prod a 307-redirect to a short-lived presigned S3 GET; CloudFront behavior 5 + the cache-bucket OAC grant removed so there is no unauthenticated edge path. Org-authz is prod-only (CACHE_BUCKET-gated); dev still serves local files (CLI keys by filename slug, not UUID). Adversarial security review found no constructable cross-org read; cdk diff confirmed the Web + NetworkAndData changeset with autri-auth-and-compute untouched (decoupled from F1.b). The api-cache app + infra halves must deploy together (Wave 1) — see autri-infra/docs/rollback.md.


Execution Plan (waves + sequencing)

Locked with Dan 2026-06-07. The organizing principle: separate code-work (parallelizable, pre-deploy) from verification-work (post-deploy, the gate). That split is what makes Wave 0 a /hl:ship candidate while the deploy + verification waves stay human-led and sequential.

Critical-path spine: decide /api/cache fix-shape → code /api/cache + welcome fixes → merge → clean deploy → F2 (S2.1→S2.2) → invite first users. Everything else hangs off this in parallel.

Wave 0 — Pre-deploy prep (parallel; zero prod risk; the /hl:ship-able subset)

Two micro-stages because two tasks need a decision before coding:

0a — Decide, before coding (small, fast):

  • /api/cache fix-shape (resolves OQ#3). In prod, page images/figures are served S3→CloudFront with no Lambda in the read path, so editing route.ts only fixes the dev path. Real org-auth needs a design call: an authenticated endpoint issuing org-scoped presigned URLs vs. a CloudFront function / Lambda@Edge doing the check. This may touch CDK, not just the web image — so the "clean deploy" could carry an infra change. Decide the shape first.
  • F1.b S1.5 — auth-stack health + welcome-fix deploy path (elevated to Wave 0 because it gates the welcome fix). The welcome notification lives in the post-confirm Lambda, which is part of the wedged auth-and-compute stack. Determine whether it can be updated via a targeted aws lambda update-function-code (sidestepping the wedged CloudFormation update) or must wait on auth reconciliation. Also confirm logins work post-rollback.

0b — Code (parallel, fan-out-ready):

  • F2 S2.4 (code) — /api/cache org-authz fix per the 0a shape. The trust-critical one; must land in main to ride the deploy.
  • Welcome-message fix — repoint the notification body + link off the dead /help/claude-desktop (cut MCP feature) to a real first-run action (e.g. upload page). Deploy path per 0a.
  • (Optional polish, non-gating) Mascot 404 pagenot-found.tsx using the WIP golden-retriever SVGs in dogs/, with the in-app feedback→GitHub button (D55) that auto-captures the attempted URL + referrer so a 404 report reads "user hit /help/claude-desktop", not a generic "broken". Closes the feedback loop on whatever lands users there. Slate if Wave 0 has room; otherwise fast-follow.
  • F6 S6.1 (scaffold) — stand up the harness (non-visual cases in the existing vitest runner; visual in Crucible). Non-blocking.
  • F1.a S1.4 — document the rollback procedure.

→ Merge Wave-0 code to main. cdk diff (F1.a S1.1) runs after this merge — the deploy now ships more than current main, so the diff must reflect what actually goes out.

Wave 1 — Deploy (sequential, human-led; the gate to prod state)

F1.a S1.2 (deploy.sh all — main + Wave-0 code + migrations 013/014; Docker prune first) → S1.3 (post-deploy Crucible render smoke). Dan approves the actual deploy. The welcome-fix and any /api/cache infra change ride here (or via the targeted Lambda update from 0a).

Wave 2 — Post-deploy verification (3 parallel tracks)

TrackStoriesLaunch gate?
F2 isolationS2.1 (sign up 2 real users) → S2.2 (cross-user probe matrix, incl. now-fixed /api/cache) → S2.3 (white-box retrieval assertion)YES — hard gate
F3 reliabilityS3.1 (DLQ root-cause) ∥ S3.2 (failure surfacing) ∥ S3.3 (monitoring routing + single-AZ decision + restore drill)no
F4 costS4.1 (real ingest, non-zero vision cost) ∥ S4.2 (query cost)no

F5 S5.0 (determine cohort doc types) runs anytime here; gates the conditional S5.1.

Wave 3 — Harness consolidation (non-blocking fast-follow)

F6 S6.2 (seed the verified F2/F3/F4 checks as permanent cases) → S6.3 (wire into the loop). Depends on Wave-2 checks existing; does not gate launch.

Wave 4 — Launch

Dan invites the first individual users (a couple from STEM Racing + friends/family) → confirm ≥1 real user has uploaded + chatted = "launched." F5 S5.1 only if the cohort brought non-PDF docs.

Parallelism summary

  • Fan-out-ready (Wave 0b code): /api/cache fix · welcome fix · 404 page · harness scaffold · rollback doc — independent; a /hl:ship wave candidate.
  • Sequential / human-led: the deploy (single prod action, Dan-approved); F2 S2.1→S2.2 (need users before probing); launch after F2 is green.
  • Parallel post-deploy: F2 / F3 / F4 tracks run concurrently; only F2 gates the invite.

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-07Beta = individual users (self-orgs), no org infra builtPost-confirm Lambda already 1:1-provisions; cohort is individuals (STEM Racing few + friends/family)Build/test Team-tier multi-user-org (D21 — deferred)
2026-06-07F2 verified via 2 real Cognito users on the real cookie pathAUTRI_DEV_AUTH is prod-forbidden + a synthetic bypass; real path is the only honest checkdev-auth seam (code-disabled in prod); staging-parity (contradicts "in prod")
2026-06-07/api/cache gets real org-authz (in P0)Trust-first beta; today it's UUID-obscurity only on user document imagesAccept obscurity / defer (rejected for a trust pitch)
2026-06-07Deploy ships cost-obs, NOT a tenancy fix5371370 already live in prod (verified) — the urgency framing was false— (corrected decisions.md L184)
2026-06-07F6 decoupled from the F2 launch gate; non-blockingNet-new harness build must not gate the highest-risk check; a vitest runner already existsF6-before-F2 (puts build risk on critical path)
2026-06-07Account-level Budgets are the beta spend brake; no per-org/rate cap$50/$100/$200 budgets exist + alert; per-tenant metering is H2Build a per-org quota (overkill for ~10 friendly users)
2026-06-07Single-AZ RDS accepted for beta (written)~10-user beta; deletion-protection + 7-day PITR; do one restore drillMulti-AZ (unnecessary cost for beta)
2026-06-07Onboarding folded in, not a featureDan owns invites; support = shipped feedback→GitHub button (D55); "launched" = invited AND ≥1 real upload+chatA standalone onboarding feature (over-scoped)
2026-06-07Execute human-led + sequential, not /hl:ship parallelOps/verification, prod-facing, interdependentParallel agent fan-out (mismatched)
2026-06-07 (S3)/api/cache fix shape = authenticated org-scoped presigned-URL endpoint (resolves OQ#3)Build-it-right over a temporary workaround (Dan); the app already serves private content via presigned GET (/api/feedback/screenshot-url) + PUT (/api/kb/[kbId]/upload-url), so this unifies the asset-serving model rather than adding a pattern. SDK already a dep. ~1.5–2d, low-risk. Touches CDK in web + network-and-data only — NOT the wedged auth stack (decoupled from the auth reconciliation).Lambda@Edge viewer-auth (heaviest ops: us-east-1-only, slow global deploy+rollback; edge→RDS org lookup is an anti-pattern); main-Lambda byte passthrough (simplest, but regresses EPIC-4.5 Phase 2's deliberate no-Lambda byte path); CloudFront signed cookies (CDN-edge-cached private images, but needs keypair mgmt + org-prefixed cache keys for marginal benefit on per-user private images)
2026-06-07 (S3)Auth stack: reconcile to clean state + full clean deploy (NOT leave wedged); welcome-fix rides that deploy (resolves OQ#1 for beta + sets the welcome-fix deploy path)Dan: get all cloud infra correct, no tail deployment tech debt. Root cause now known = CloudFormation cross-stack export-in-use deadlock on the app.autri.ai ACM cert (auth-and-compute exports CertsAppCert; web imports it via cdn.ts props.appCert) — recurring 4× since 2026-05-27. F1.b diagnostic = cdk diff autri-auth-and-compute: if the pending change is just the post-confirm Lambda code (most likely) → a clean cdk deploy just works and the welcome-copy fix rides it; if main actually moved/removed the cert output → do the standard export-severing sequence (break web's import → deploy web → update auth → re-link), carefully (live cert).Leave wedged + targeted aws lambda update-function-code for the welcome fix (rejected — leaves CFN drift / tail tech debt; the explicit thing Dan wants to avoid)

Risks

RiskLikelihoodImpactMitigation
Cross-user data leak under real trafficLow (enforced in code)Catastrophic (trust-first beta)F2 probe matrix = hard launch gate; test before any real data
/api/cache image leak (UUID-obscurity only)MediumHigh (user document images)F2.S2.4 org-scoped presigned URLs
Wedged auth stack blocks/complicates deployLow (beta path excludes it)MediumF1.a never touches it; F1.b health-check + explicit scope call
The 10 stranded DLQ jobs = silent loss already happeningMediumTrust erosion / lost dataF3.S3.1 root-cause + S3.2 surfacing fix
Cost columns uncheckable (RDS private, no bastion)MediumF4 blockedApp query-path primary; optional read-only in-VPC Lambda
Alarm channel missed (single email)Medium (already happened)Failures sit un-actionedF3.S3.3 confirmed-subscriber assert + higher-signal route
Harness scope creep (F6)MediumEats timelineNon-blocking, seeded-only, grows per-epic
Cohort brings non-PDF docsMedium (friends/family)Dead source paneF5 cohort-gating + fast-follow

Known Issues / Tech Debt

IssueSeverityNotes
Welcome notification points beta users at Claude Desktop / MCPLow (but first impression)Post-confirm Lambda's notifications row says "Connect Autri to Claude Desktop to get started" → /help/claude-desktop; MCP is cut from beta (D56). One-line copy fix in autri-infra/lambda-handlers/post-confirm.
addDocumentsToKb uses a different guard mechanism than the other 8 mutationsLowInline org-WHERE + worker-mode early-return; probe separately (F2). decisions.md L183 annotated 2026-06-07.
CDK StackDriftStatus = NOT_CHECKEDLowDrift narrative is memory-sourced; F1.S1.1 measures it for real.

Open Questions (genuinely still open)

Most prior open questions were resolved in the blue-team pass (commit scope, rollback, support, success metric, cohort shape, cost brake). Two more resolved in session 3 (2026-06-07) — see Decisions Log. Remaining:

  1. Auth-stack reconciliation — leave wedged or fix? RESOLVED (S3): reconcile to clean state + full clean deploy (root cause = cert export-in-use deadlock; F1.b diagnostic = cdk diff autri-auth-and-compute). No tail tech debt.
  2. Higher-signal alarm route — is email + a watched-inbox enough for beta, or wire Slack/SMS for DLQ/error-rate now? (F3.S3.3.) — still open.
  3. /api/cache fix shape RESOLVED (S3): authenticated org-scoped presigned-URL endpoint (unifies with existing presigned GET/PUT; touches web + network-and-data CDK only).

BLUE-TEAMED 2026-06-07. Wave-0a decisions locked session 3 (2026-06-07). Ready for Wave-0b code + human-led execution. F2 is the launch gate.

Review

🔒

Enter your access token to view annotations