Epic: Beta Launch (deploy current main, verify it holds in prod, invite the first users)
BLUE-TEAMED 2026-06-07 (drafted, fresh-agent ultracode red-team, then scope-cut with Dan). This is P0 on the roadmap (Horizon 1). It is not a build — the web app is already deployed and feature-complete at app.autri.ai. P0 is deploy current main → verify the trust-critical things hold in prod → invite the first individual users → measure. Mostly ops/verification, sequential. Ladders to North Star B4 (tiered security), B7 (dependable), B2 (know-cost-cold), B1 (trustable retrieval).
Verification stance (the spine of this epic). Every acceptance criterion names a deterministic check and an owner: 🤖 AI-mechanized (I close the loop myself before review) or 🧑 human-judgment (Dan's smoke test / qualitative call). The rule: the mechanized loop closes first; the human smoke-test is the backstop, not the primary. A criterion with no checking mechanism is itself a red-team finding.
Beta shape (locked 2026-06-07). The beta is individual users, not organizations — a couple of people from STEM Racing plus family/friends. Each signup auto-provisions its own org (the Cognito post-confirm Lambda inserts org + user + Personal library;
users.organization_id = users.id = Cognito sub, a strict 1:1 self-org). So "an individual user" is a tenant — we build zero org infrastructure for the beta. The Team-tier multi-user-shared-org path (D21:library_accessgrants, invites, admin roles) exists in schema but the beta does not exercise it (see Non-Goals).
Features & Stories
Each feature carries a Verification mechanism line (🤖 AI / 🧑 human).
Design
Data Model Changes
None. Deploys migrations 013/014 (additive, already written). The /api/cache fix is authorization logic + presigned-URL issuance, no schema change.
Context
Overview
Goals & Non-Goals
Goals:
- Deploy current
mainto prod (web + migrations), shipping cost-observability (D66) + UI fixes. (Not a security ship — see correction below.) - Prove the trust-critical invariants hold in prod with ≥2 real individual users: cross-user data isolation across every surface, and no silent data loss.
- Confirm cost observability works on the real deployed worker (the "measure first" premise).
- Seed a behavioral regression harness (F6) with the launch-critical checks — non-blocking, future-proofs the work.
- Invite the first individual users and start collecting cost + quality + behavior data.
Non-Goals:
- Not a feature build. No new product capability beyond what's on
main(the cohort-gated source view F5 is the one conditional exception). - Not the Team-tier multi-user-org path (D21) — invites,
library_accessgrants between users, admin roles. Beta = individual self-orgs; F2 tests cross-individual isolation only. Don't over-build org infra. - Not per-org / per-user rate or cost caps. Account-level AWS Budgets ($50/$100/$200, email-notified) are the spend brake; per-tenant metering is H2 (D18). Accepted risk for a ~10-user friendly beta: a looping user is alerted-on, not throttled.
- Not commerce (Stripe/tiers/metering — H2), not MCP (cut, D56), not the Horizon-0 quality cluster, not enterprise (H3).
Problem Statement
The platform is built and deployed, but the beta is not launched because the trust-critical things have never been exercised under real conditions:
- Prod is behind
main— missing cost-observability. (It is not missing a tenancy fix — see the correction below; that framing was wrong.) Deploying is small but the auth stack is in a failed-deploy state that must be understood first (F1). - Cross-user isolation is enforced in code but never tested with ≥2 real users in prod. A leak between users in a trust-first beta is catastrophic — the single highest real risk. Enforced-in-code ≠ verified.
- Failures may not surface. Monitoring exists, but 10 jobs sit in dead-letter queues (3 ingest + 7 extract, in ALARM since 2026-05-31, ~1 week un-actioned) — and we don't know whether those failures surfaced to the uploaders or vanished. The alarm channel is a single un-escalated email, which is why they sat unnoticed.
- Cost columns are worker-written — a local CLI ingest doesn't populate them, so D66 is verified only by seed, never by a real deployed run.
What Is This Epic?
The deploy-verify-invite effort that turns a live-but-unlaunched platform into a running beta. Ship current main, run a fixed verification battery against prod (cross-user probe matrix, induced-failure surfacing, real-ingest cost), then invite the first individual users. Output = evidence the beta is safe for real user data, plus a seeded regression harness — not new product code.
Dependents
- Everything in Horizon 2+ unblocks on real beta usage data this epic produces (pricing B5, API+dogfood B3/B6, enterprise B4).
- Every future epic extends F6's harness with its own cases.
Dependencies
- The deployed platform (EPIC-5, 2026-06-01): Cognito + Google auth, email allowlist, per-user self-orgs, multi-tenancy in code, upload→inspect→chat→cost UI. Verified, not built.
- AWS self-verification access (
autri-prodCLI profile, read across all 5 stacks/RDS/CloudWatch/SQS/Budgets — documented inCLAUDE.md). Makes F1/F3/F4 self-checkable. Gap: RDS is private and no SSM/bastion exists — DB-level checks need a new in-VPC read Lambda or the app's query path (F4). deploy.sh/deploy-web.sh—deploy.sh allruns migrate→web→ingestion→monitoring and explicitly excludes the auth stack (it's drift-blocked). Migrations auto-apply via theautri-db-migrateLambda.- CDK stacks —
autri-auth-and-computeis inUPDATE_ROLLBACK_COMPLETE(failed 2026-06-01 deploy); the beta payload doesn't touch it (F1.b). - Real Cognito signups — the F2 two-user provisioning path (replaces the prod-forbidden
AUTRI_DEV_AUTH). - Crucible + the existing
vitestrunner (incl.app/__tests__/scope.test.ts) — the F6 harness substrate. - The shipped in-app feedback→GitHub-issues button (D55) — the named beta support channel.
Current State
- Prod:
app.autri.ai, live. Deployed web image0d22bbb(2026-06-02). Delta tomain= 59 commits, but only 5 touchapp/(thedeploy-web.shsurface); the other 54 are eval/ingestion/docs that don't deploy here. Migrations 013/014 ship viadeploy.sh migrate. - Tenancy fix already live (CORRECTION). The query-playground org-scope fix (
5371370, D13's "last read-leak surface") is an ancestor of the deployed image — it shipped 2026-05-29. The deploy does not ship it. (decisions.mdself-contradicted on this — L184 fixed 2026-06-07. Same stale-doc trap as last session, inverted; caught by a fresh agent checking git against the live Lambda.) - Multi-tenancy (D13): read / mutation / query-playground enforcement shipped (org-scoped lookups, not-found on cross-org). 8 of 9 mutations use
requireKbAccess;addDocumentsToKbuses an inline org-WHERE+ worker-mode early-return (two mechanisms). Never validated by real multi-user traffic — the graduation gate. - Auth stack wedged:
autri-auth-and-compute=UPDATE_ROLLBACK_COMPLETE, rollback blocked 3× by the cert export. The drift is MCP/AgentCore infra — which D56 cut from beta. CloudFormationStackDriftStatus = NOT_CHECKED(drift is asserted from memory, never measured). - Monitoring (
autri-monitoring): alarms for DLQ depth, ingestion-degraded, postgres-connections, web/chat error-rate; all route to one SNS topic with a single email subscriber. RDS: single-AZdb.t4g.small(MultiAZ=false), 7-day backups + PITR, deletion-protection on. Restore (RTO) never tested. - Cost brakes exist: 3 monthly AWS Budgets ($50/$100/$200, ACTUAL+FORECASTED email alerts). No per-org/rate cap.
- Cost observability (D66): rate table + per-doc stage-breakdown columns (013) + per-query Sonnet cost (014). Not deployed — rides this deploy; worker-written, so only a real deployed ingest exercises it.
Affected Systems
| System / Layer | How It's Affected |
|---|---|
deploy.sh / deploy-web.sh | The beta deploy path (web+migrations); never touches the wedged auth stack; Docker prune first (ENOSPC) |
| Migrations 013/014 | Auto-apply via autri-db-migrate; verified additive (forward-only is safe) |
autri-auth-and-compute | Health-checked (logins work post-rollback?), then a deliberate in/out-of-beta-scope call — NOT auto-touched |
| Multi-tenancy enforcement | Verified, not changed — probe matrix exercises every guarded surface (incl. /api/cache, addDocumentsToKb) |
/api/cache/[...path] (page images/figures) | Gets org-authorization (currently UUID-obscurity only) — prod fix via org-scoped presigned URLs |
| Ingestion worker + DLQs | Cost-column writes + failure-surfacing verified; the 10 stranded messages investigated |
autri-monitoring / SNS | Alarm→subscription wiring asserted; higher-signal route for DLQ/error-rate |
Crucible / vitest runner | The F6 harness substrate (non-visual cases in vitest, visual in Crucible) |
| Inspector source pane (F5) | Only if the cohort brings non-PDF docs (more likely now — friends/family, not just STEM Racing PDFs) |
Approach
Sequential, gated on the deploy. The isolation gate (F2) is decoupled from the harness build (F6) so net-new harness work never blocks launch.
- Deploy current
main(F1.a) — Docker prune,deploy.sh all(web + migrations), smoke. Separately, health-check the wedged auth stack (F1.b) and decide if beta needs it touched (likely not — MCP is cut). - Verify isolation (F2) — sign up 2 real individual users (real Cognito/Google, real cookies), run the cross-user probe matrix as a minimal scripted check on the real auth path. Hard launch gate — must pass before any real user data goes in. Folding it into F6 is a fast-follow, not a prerequisite.
- Verify reliability (F3) — investigate the DLQ backlog, prove failures surface, confirm monitoring is enough.
- Verify cost (F4) — one real ingest through the deployed worker; confirm columns populate.
- (Conditional) source view (F5) — only if the cohort brings non-PDF docs.
- Seed the regression harness (F6) — non-blocking; turns F2/F3/F4 checks into permanent cases.
→ Beta is "launched" when: F1–F4 green, the first users are invited (Dan), AND ≥1 real user has uploaded + chatted. Green infra alone is not "launched."
Onboarding/support (folded in, not a feature): Dan owns invitations. Support = the shipped in-app feedback→GitHub-issues button (D55) — named here, not built; someone watches the issues. First-run sanity is covered by F2.S2.1 (signing up the 2 real users is the first-run check).
Poor
/hl:shipparallel-wave fit (sequential, ops, prod-facing). Execute human-led.
Key methodology — the cross-user isolation probe matrix (F2)
The invariant (D13): for every guarded surface, a request from user A for user B's resource returns "not found" — same shape as a genuinely missing resource — no existence leak, no data. Run as user A against user B's real resources, on the real authenticated path (real Cognito cookies, never mocked — AUTRI_DEV_AUTH is statically compiled out in prod and is a synthetic bypass anyway):
| Surface | Probe (as user A, targeting user B's resource) | Expected |
|---|---|---|
KB read (/kb/[kbId]) | open B's kbId | not-found, no leak |
Doc read (/docs/[id]) | open B's docId | not-found |
Chat (/api/chat) | query against B's kbId | not-found / no cross-user chunks |
Query playground (/docs/[id]/query) | run query on B's docId | not-found |
Mutations — requireKbAccess (8 of 9) | rename/delete/approve/etc. on B's kbId/docId | not-found, no mutation |
Mutation — addDocumentsToKb (separate mechanism) | add docs to B's KB / add B's docId to your KB | not-found (probe its actual prod early-return behavior) |
| kbId+docId mix-and-match | your kbId, B's docId | not-found (knowledge_base_id = kbId defense) |
/api/cache/[...path] (page images/figures) | GET <B-docId>/page-1.png while authed as A | not-found, not image bytes (the new authz — see F2.S2.4) |
Retrieval (vector/fts/lookup) | — | org-blind by design: verified transitively via the Chat/query guards + a white-box assertion that no caller passes an unvalidated kbId (not an independent runtime probe) |
Also confirm the positive case — user A fully accesses its own resources — so we're testing isolation, not a blanket 404.
F1 — Deploy current main (+ assess the wedged auth stack) → B2 · S–M · 🧱
F1.a — Routine deploy (ships the beta payload; never touches auth):
| Story | Summary | Acceptance |
|---|---|---|
| S1.1 | cdk diff / detect-stack-drift of the 5 app/ commits' deploy surface; confirm scope (not "59 commits") | Real diff artifact reviewed; drift measured not assumed |
| S1.2 | Prune Docker; deploy.sh all (web + migrations 013/014) | Prod at main's SHA; migrations applied; no ENOSPC |
| S1.3 | Post-deploy render smoke vs prod (Crucible) | Auth + upload + inspect + chat + cost-display render green |
| S1.4 | Document + dry-run rollback (re-tag web Lambda to prior ECR image; record that 013/014 are additive → no DB rollback needed) | A tested rollback path exists, not an open question |
Verification mechanism: 🤖 cdk diff + prod-SHA check + Crucible render smoke + a rollback dry-run. 🧑 Dan approves the actual deploy.sh run (outward-facing).
F1.b — Auth-stack health (investigate, then decide):
| Story | Summary | Acceptance |
|---|---|---|
| S1.5 | Confirm the wedged auth stack left logins functionally intact; decide if beta needs it reconciled | Login/signup verified working; an explicit "auth reconciliation is / isn't in beta scope" decision recorded (MCP cut → likely out) |
Verification mechanism: 🤖 describe-stacks state + an end-to-end login check. 🧑 Dan's call on whether to do the manual cert-re-pin surgery now or defer (non-launch-gating).
F2 — Verify cross-user isolation in prod with ≥2 real users → B4 · S–M · 🧱 · HIGHEST RISK / LAUNCH GATE
| Story | Summary | Acceptance |
|---|---|---|
| S2.1 | Sign up 2 real individual users (real Google/Cognito) — each gets a self-org + KB + doc | Two real auth contexts; doubles as the first-run onboarding check (fresh signup lands on a usable empty state) |
| S2.2 | Scripted cross-user probe matrix on the real cookie path | Every cross-user probe → not-found, no leak; positive (self-access) works; documented |
| S2.3 | White-box assertion: no production caller passes an unvalidated kbId into retrieval | grep/code-review confirms retrieval's org-blindness is always guarded one layer up |
| S2.4 | Add org-authorization to /api/cache (page images/figures) — resolve docId→org against the session; org-scoped presigned URLs in prod | Cross-user image GET returns not-found, not bytes; matrix row green |
Verification mechanism: 🤖 the probe matrix (deterministic per cell) + the white-box grep + the /api/cache probe — re-runnable forever. 🧑 the 2-account setup is an honest human task; a one-time review that the suite hits the real path.
F3 — Verify failures surface + monitoring is sufficient → B7 · S · 🧱
| Story | Summary | Acceptance |
|---|---|---|
| S3.1 | Investigate the live DLQ backlog (3 ingest + 7 extract, in ALARM since 2026-05-31) | Root cause known (note: some are planted e2e test data); determined whether real failures surfaced or vanished |
| S3.2 | Verify failed-ingest surfacing end-to-end | An induced bad ingest shows a terminal error state to the user; never silently vanishes or hangs in "processing" |
| S3.3 | Confirm monitoring routing + accept availability posture | Each alarm has an SNS action with a confirmed subscriber (deterministic); a human confirms they read it; single-AZ accepted as a written decision; one backup-restore drill records RTO |
Verification mechanism: 🤖 induced-failure harness case, sqs get-queue-attributes, describe-alarms AlarmActions + sns list-subscriptions-by-topic, describe-db-instances backup config. 🧑 someone confirms the alarm channel is watched (consider a higher-signal route for DLQ/error-rate); the restore drill.
F4 — Real worker cost verification → B2 · S · 🔀
| Story | Summary | Acceptance |
|---|---|---|
| S4.1 | One real ingest of a structured/figure-heavy doc (e.g. an FIA technical PDF) through the deployed worker | Per-doc cost columns + stage breakdown populate; cost is order-of-magnitude plausible for that doc type (not a blanket "S6 range"); figure-vision cost is NON-zero (the D66 regression this must prove fixed) |
| S4.2 | One real chat query | Per-query Sonnet cost populates; bound checked against pricing.ts directly (S6 has no query baseline); inspector cost line renders |
Verification mechanism: 🤖 read the cost columns via the app's in-VPC query path (primary — no SSM/bastion exists) or a tiny read-only in-VPC Lambda (if a DB-level assert is wanted, its own approved story); Crucible asserts the inspector cost line renders. 🧑 sanity-check the dollar figures.
F5 — (Cohort-gated) Source view for all doc types → B1 · S · 🔀
| Story | Summary | Acceptance |
|---|---|---|
| S5.0 | Determine first-cohort doc types | More likely to trigger now — friends/family may bring docx/prose, not just STEM Racing PDFs. Non-PDF present → F5 is P0; PDF-only → fast-follow |
| S5.1 | (If triggered) render source pane for non-PDF doc types | docx/prose/markdown show a usable source view, not a dead pane |
Verification mechanism: 🤖 headless Preview snapshot of a non-PDF source pane, assert non-empty. 🧑 visual judgment it's usable.
F6 — Verification & Regression Harness (cross-cutting; non-blocking; seeded) → B7, B1 · M · 🔀
| Story | Summary | Acceptance |
|---|---|---|
| S6.1 | Non-visual cases (F2 status codes, F3 terminal-state row, F4 cost-column non-null) in the existing vitest runner against a deployed env with real cookies; visual cases (F4 cost line, F5 source pane) in Crucible | The right tool per case type (answers the OQ#7 "is Crucible right?" — split it) |
| S6.2 | Seed with this epic's launch-critical cases | F2 matrix, F3 surfacing, F4 cost all encoded |
| S6.3 | Wire into the loop + convention | Runnable per-branch / pre-deploy; "add a case when you ship a feature" documented |
Verification mechanism: 🤖 harness self-reports green; deterministic cases. 🧑 one-time faithfulness check (real path, not mocks) + Dan's standing smoke-test as the backstop layer.
Stories
| Story | Summary | Status | PR |
|---|---|---|---|
| F1.a S1.1–S1.3 | Deploy main + diff + smoke | Not started (Wave 1, human-led) | |
| F1.a S1.4 | Document + dry-run rollback | Doc MERGED (autri-infra/docs/rollback.md); dry-run is a Wave-1 prod action | autri-infra#4 |
| F1.b S1.5 | Auth-stack health → in/out-of-scope decision | Not started. Root cause now evidenced: cdk diff shows autri-auth-and-compute has a pending [-] removal of the CertsAppCert export that autri-web imports → the export-in-use deadlock behind UPDATE_ROLLBACK_COMPLETE. Wave-1 needs the export-severing sequence. | |
| F2 S2.1–S2.3 | Sign up 2 real users + cross-user probe matrix + white-box retrieval assertion | Not started (Wave 2, post-deploy) | |
| F2 S2.4 | /api/cache org-authz | MERGED to main (Wave 0b). Cross-user probe is still Wave 2. | autri#63 + autri-infra#2 |
| F3 S3.1–S3.3 | DLQ backlog + surfacing + monitoring/availability | Not started | |
| F4 S4.1–S4.2 | Real worker + query cost (figure-vision non-zero) | Not started | |
| F5 S5.0–S5.1 | (Cohort-gated) non-PDF source view | Not started | |
| F6 S6.1–S6.3 | Behavioral regression harness (non-blocking, seeded) | Not started (S6.1 scaffold = remaining Wave-0b) | |
| (Known-issue polish) | Welcome-notification copy + dead /help/claude-desktop link | MERGED to main (rides Wave-1 deploy; lives in post-confirm Lambda) | autri-infra#3 |
S2.4 realization (session 3): instead of presigning at the inspector + chat server seams (would inject ~500-char signed URLs into the model's chat context every query + cause TTL staleness), the existing /api/cache/[...path] route became the single authz chokepoint — session + doc→org check, then in prod a 307-redirect to a short-lived presigned S3 GET; CloudFront behavior 5 + the cache-bucket OAC grant removed so there is no unauthenticated edge path. Org-authz is prod-only (CACHE_BUCKET-gated); dev still serves local files (CLI keys by filename slug, not UUID). Adversarial security review found no constructable cross-org read; cdk diff confirmed the Web + NetworkAndData changeset with autri-auth-and-compute untouched (decoupled from F1.b). The api-cache app + infra halves must deploy together (Wave 1) — see autri-infra/docs/rollback.md.
Execution Plan (waves + sequencing)
Locked with Dan 2026-06-07. The organizing principle: separate code-work (parallelizable, pre-deploy) from verification-work (post-deploy, the gate). That split is what makes Wave 0 a /hl:ship candidate while the deploy + verification waves stay human-led and sequential.
Critical-path spine: decide /api/cache fix-shape → code /api/cache + welcome fixes → merge → clean deploy → F2 (S2.1→S2.2) → invite first users. Everything else hangs off this in parallel.
Wave 0 — Pre-deploy prep (parallel; zero prod risk; the /hl:ship-able subset)
Two micro-stages because two tasks need a decision before coding:
0a — Decide, before coding (small, fast):
/api/cachefix-shape (resolves OQ#3). In prod, page images/figures are served S3→CloudFront with no Lambda in the read path, so editingroute.tsonly fixes the dev path. Real org-auth needs a design call: an authenticated endpoint issuing org-scoped presigned URLs vs. a CloudFront function / Lambda@Edge doing the check. This may touch CDK, not just the web image — so the "clean deploy" could carry an infra change. Decide the shape first.- F1.b S1.5 — auth-stack health + welcome-fix deploy path (elevated to Wave 0 because it gates the welcome fix). The welcome notification lives in the post-confirm Lambda, which is part of the wedged
auth-and-computestack. Determine whether it can be updated via a targetedaws lambda update-function-code(sidestepping the wedged CloudFormation update) or must wait on auth reconciliation. Also confirm logins work post-rollback.
0b — Code (parallel, fan-out-ready):
- F2 S2.4 (code) —
/api/cacheorg-authz fix per the 0a shape. The trust-critical one; must land in main to ride the deploy. - Welcome-message fix — repoint the notification body + link off the dead
/help/claude-desktop(cut MCP feature) to a real first-run action (e.g. upload page). Deploy path per 0a. - (Optional polish, non-gating) Mascot 404 page —
not-found.tsxusing the WIP golden-retriever SVGs indogs/, with the in-app feedback→GitHub button (D55) that auto-captures the attempted URL + referrer so a 404 report reads "user hit/help/claude-desktop", not a generic "broken". Closes the feedback loop on whatever lands users there. Slate if Wave 0 has room; otherwise fast-follow. - F6 S6.1 (scaffold) — stand up the harness (non-visual cases in the existing
vitestrunner; visual in Crucible). Non-blocking. - F1.a S1.4 — document the rollback procedure.
→ Merge Wave-0 code to main. cdk diff (F1.a S1.1) runs after this merge — the deploy now ships more than current main, so the diff must reflect what actually goes out.
Wave 1 — Deploy (sequential, human-led; the gate to prod state)
F1.a S1.2 (deploy.sh all — main + Wave-0 code + migrations 013/014; Docker prune first) → S1.3 (post-deploy Crucible render smoke). Dan approves the actual deploy. The welcome-fix and any /api/cache infra change ride here (or via the targeted Lambda update from 0a).
Wave 2 — Post-deploy verification (3 parallel tracks)
| Track | Stories | Launch gate? |
|---|---|---|
| F2 isolation | S2.1 (sign up 2 real users) → S2.2 (cross-user probe matrix, incl. now-fixed /api/cache) → S2.3 (white-box retrieval assertion) | YES — hard gate |
| F3 reliability | S3.1 (DLQ root-cause) ∥ S3.2 (failure surfacing) ∥ S3.3 (monitoring routing + single-AZ decision + restore drill) | no |
| F4 cost | S4.1 (real ingest, non-zero vision cost) ∥ S4.2 (query cost) | no |
F5 S5.0 (determine cohort doc types) runs anytime here; gates the conditional S5.1.
Wave 3 — Harness consolidation (non-blocking fast-follow)
F6 S6.2 (seed the verified F2/F3/F4 checks as permanent cases) → S6.3 (wire into the loop). Depends on Wave-2 checks existing; does not gate launch.
Wave 4 — Launch
Dan invites the first individual users (a couple from STEM Racing + friends/family) → confirm ≥1 real user has uploaded + chatted = "launched." F5 S5.1 only if the cohort brought non-PDF docs.
Parallelism summary
- Fan-out-ready (Wave 0b code):
/api/cachefix · welcome fix · 404 page · harness scaffold · rollback doc — independent; a/hl:shipwave candidate. - Sequential / human-led: the deploy (single prod action, Dan-approved); F2 S2.1→S2.2 (need users before probing); launch after F2 is green.
- Parallel post-deploy: F2 / F3 / F4 tracks run concurrently; only F2 gates the invite.
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-07 | Beta = individual users (self-orgs), no org infra built | Post-confirm Lambda already 1:1-provisions; cohort is individuals (STEM Racing few + friends/family) | Build/test Team-tier multi-user-org (D21 — deferred) |
| 2026-06-07 | F2 verified via 2 real Cognito users on the real cookie path | AUTRI_DEV_AUTH is prod-forbidden + a synthetic bypass; real path is the only honest check | dev-auth seam (code-disabled in prod); staging-parity (contradicts "in prod") |
| 2026-06-07 | /api/cache gets real org-authz (in P0) | Trust-first beta; today it's UUID-obscurity only on user document images | Accept obscurity / defer (rejected for a trust pitch) |
| 2026-06-07 | Deploy ships cost-obs, NOT a tenancy fix | 5371370 already live in prod (verified) — the urgency framing was false | — (corrected decisions.md L184) |
| 2026-06-07 | F6 decoupled from the F2 launch gate; non-blocking | Net-new harness build must not gate the highest-risk check; a vitest runner already exists | F6-before-F2 (puts build risk on critical path) |
| 2026-06-07 | Account-level Budgets are the beta spend brake; no per-org/rate cap | $50/$100/$200 budgets exist + alert; per-tenant metering is H2 | Build a per-org quota (overkill for ~10 friendly users) |
| 2026-06-07 | Single-AZ RDS accepted for beta (written) | ~10-user beta; deletion-protection + 7-day PITR; do one restore drill | Multi-AZ (unnecessary cost for beta) |
| 2026-06-07 | Onboarding folded in, not a feature | Dan owns invites; support = shipped feedback→GitHub button (D55); "launched" = invited AND ≥1 real upload+chat | A standalone onboarding feature (over-scoped) |
| 2026-06-07 | Execute human-led + sequential, not /hl:ship parallel | Ops/verification, prod-facing, interdependent | Parallel agent fan-out (mismatched) |
| 2026-06-07 (S3) | /api/cache fix shape = authenticated org-scoped presigned-URL endpoint (resolves OQ#3) | Build-it-right over a temporary workaround (Dan); the app already serves private content via presigned GET (/api/feedback/screenshot-url) + PUT (/api/kb/[kbId]/upload-url), so this unifies the asset-serving model rather than adding a pattern. SDK already a dep. ~1.5–2d, low-risk. Touches CDK in web + network-and-data only — NOT the wedged auth stack (decoupled from the auth reconciliation). | Lambda@Edge viewer-auth (heaviest ops: us-east-1-only, slow global deploy+rollback; edge→RDS org lookup is an anti-pattern); main-Lambda byte passthrough (simplest, but regresses EPIC-4.5 Phase 2's deliberate no-Lambda byte path); CloudFront signed cookies (CDN-edge-cached private images, but needs keypair mgmt + org-prefixed cache keys for marginal benefit on per-user private images) |
| 2026-06-07 (S3) | Auth stack: reconcile to clean state + full clean deploy (NOT leave wedged); welcome-fix rides that deploy (resolves OQ#1 for beta + sets the welcome-fix deploy path) | Dan: get all cloud infra correct, no tail deployment tech debt. Root cause now known = CloudFormation cross-stack export-in-use deadlock on the app.autri.ai ACM cert (auth-and-compute exports CertsAppCert; web imports it via cdn.ts props.appCert) — recurring 4× since 2026-05-27. F1.b diagnostic = cdk diff autri-auth-and-compute: if the pending change is just the post-confirm Lambda code (most likely) → a clean cdk deploy just works and the welcome-copy fix rides it; if main actually moved/removed the cert output → do the standard export-severing sequence (break web's import → deploy web → update auth → re-link), carefully (live cert). | Leave wedged + targeted aws lambda update-function-code for the welcome fix (rejected — leaves CFN drift / tail tech debt; the explicit thing Dan wants to avoid) |
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Cross-user data leak under real traffic | Low (enforced in code) | Catastrophic (trust-first beta) | F2 probe matrix = hard launch gate; test before any real data |
/api/cache image leak (UUID-obscurity only) | Medium | High (user document images) | F2.S2.4 org-scoped presigned URLs |
| Wedged auth stack blocks/complicates deploy | Low (beta path excludes it) | Medium | F1.a never touches it; F1.b health-check + explicit scope call |
| The 10 stranded DLQ jobs = silent loss already happening | Medium | Trust erosion / lost data | F3.S3.1 root-cause + S3.2 surfacing fix |
| Cost columns uncheckable (RDS private, no bastion) | Medium | F4 blocked | App query-path primary; optional read-only in-VPC Lambda |
| Alarm channel missed (single email) | Medium (already happened) | Failures sit un-actioned | F3.S3.3 confirmed-subscriber assert + higher-signal route |
| Harness scope creep (F6) | Medium | Eats timeline | Non-blocking, seeded-only, grows per-epic |
| Cohort brings non-PDF docs | Medium (friends/family) | Dead source pane | F5 cohort-gating + fast-follow |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| Welcome notification points beta users at Claude Desktop / MCP | Low (but first impression) | Post-confirm Lambda's notifications row says "Connect Autri to Claude Desktop to get started" → /help/claude-desktop; MCP is cut from beta (D56). One-line copy fix in autri-infra/lambda-handlers/post-confirm. |
addDocumentsToKb uses a different guard mechanism than the other 8 mutations | Low | Inline org-WHERE + worker-mode early-return; probe separately (F2). decisions.md L183 annotated 2026-06-07. |
CDK StackDriftStatus = NOT_CHECKED | Low | Drift narrative is memory-sourced; F1.S1.1 measures it for real. |
Open Questions (genuinely still open)
Most prior open questions were resolved in the blue-team pass (commit scope, rollback, support, success metric, cohort shape, cost brake). Two more resolved in session 3 (2026-06-07) — see Decisions Log. Remaining:
Auth-stack reconciliation — leave wedged or fix?RESOLVED (S3): reconcile to clean state + full clean deploy (root cause = cert export-in-use deadlock; F1.b diagnostic =cdk diff autri-auth-and-compute). No tail tech debt.- Higher-signal alarm route — is email + a watched-inbox enough for beta, or wire Slack/SMS for DLQ/error-rate now? (F3.S3.3.) — still open.
RESOLVED (S3): authenticated org-scoped presigned-URL endpoint (unifies with existing presigned GET/PUT; touches/api/cachefix shapeweb+network-and-dataCDK only).
BLUE-TEAMED 2026-06-07. Wave-0a decisions locked session 3 (2026-06-07). Ready for Wave-0b code + human-led execution. F2 is the launch gate.