Foundry Foundry

W3 Implementation — Requirements

Scope contract for the W3 web-stack implementation work. Locked 2026-05-26 via /hl:blue-team against projects/autri/sub-systems/web-stack-w3. Companion to the autri-beta requirements doc — this doc is narrower-scope (just the W3 web layer) and inherits the broader beta contract.

Source: 17 locked decisions + 18 known issues + 13 cross-cutting concerns from the sub-system design doc, triaged into Must/Should/Nice/Out-of-Scope. The triage's load-bearing pushback: parallel-domain validation was cut because there are no production users to protect. Direct cutover with fix-forward.


Goal

Replace Amplify Hosting with W3 DIY (CloudFront + S3 + Lambda Function URL) as the runtime for app.autri.ai. End state: Dan + mom + STEM Racing engineer can hit app.autri.ai, log in via Google, chat with their KBs, see inspector page images, and create connectors that install in Claude Desktop. Amplify CDK constructs torn down.

This is the implementation contract for web-stack-w3 per D43. The detailed design lives in the sub-system doc.


Definition of Done

W3 implementation is complete when all of the following hold:

  • app.autri.ai serves the Next.js app via the new W3 CloudFront distribution (DNS swap completed; old Amplify CloudFront receives no traffic)
  • Login via Google federation works end-to-end (browser → Cognito hosted UI → callback → session cookie → DB lookup)
  • /kb and other DB-touching SSR routes return real data (Main Lambda reaches RDS via VPC private subnets)
  • /api/chat streams Anthropic responses correctly through CloudFront → Chat Lambda Function URL (no buffering; AI SDK on client parses chunks live)
  • Inspector page images load (/api/cache/<doc-id>/page-NN.png served by CloudFront directly from S3 cache bucket; no Lambda invocation)
  • D44 connector creation works end-to-end: server action → Connector-Management Lambda → Cognito CreateUserPoolClient → display 3 paste fields (server URL + bearer + client_id/secret)
  • Pasted credentials work in Claude Desktop Advanced settings; MCP queries succeed
  • Three critical CloudWatch alarms armed (Main Lambda error rate, Chat Lambda error rate, RDS active-connections)
  • Amplify CDK constructs (CfnApp + CfnBranch + CfnDomain) deleted from auth-and-compute stack
  • Deploy + rollback procedure documented in autri repo (docs/deploy.md or README)
  • Cloudflare-must-stay-DNS-only constraint documented (CDK comment + project README)

In Scope (MUST-HAVE)

Cleanup (same session, post-swap)

  1. Delete CfnApp + CfnBranch + CfnDomain from auth-and-compute stack
  2. Verify old Amplify CloudFront receives no traffic post-DNS-propagation

DNS / Auth ops

  1. DNS swap — repoint app CNAME from current Amplify CloudFront (d2bkdemcj0sjyg.cloudfront.net) to new W3 CloudFront distribution

Infrastructure (CDK in autri-infra)

  1. New WebStack construct (or repurpose lib/auth-and-compute/amplify.ts into web.ts)
  2. Main Lambda function — VPC config, IAM execution role, env vars, Function URL
  3. Chat Lambda function — separate from Main (Q1'). VPC config, IAM, env, Function URL, second build artifact
  4. Connector-Management Lambda — no VPC (Cognito reachable via VPC endpoint or NAT). Tight IAM: cognito-idp:CreateUserPoolClient + UpdateUserPoolClient + DeleteUserPoolClient, scoped to user pool ARN only
  5. 3 Lambda execution roles + IAM policies
  6. Shared SSR/Chat Lambda security group + RDS SG ingress rule on port 5432
  7. Static S3 bucket + Origin Access Control (OAC)
  8. CloudFront distribution with 5 behaviors:
    • /_next/static/* → static S3 bucket (cached, 1-year TTL)
    • /static/* → static S3 bucket
    • /api/cache/* → existing cache S3 bucket (from NetworkAndData) via OAC
    • /api/chat → Chat Lambda Function URL (origin response timeout = 60s)
    • Default → Main Lambda Function URL
  9. CloudFront Origin Request + Cache Policies pinned (per Q12):
    • Default + /api/chat: AllViewerExceptHostHeader + CachingDisabled
    • Static + cache behaviors: CORS-S3Origin + CachingOptimized
  10. CloudFront Response Headers Policy (HSTS + X-Frame-Options + X-Content-Type-Options + Referrer-Policy + minimal CSP)
  11. Cognito app.autri.ai resource server (NEW); verify mcp.autri.ai resource server exists
  12. CDK stack outputs (function names, alias names, distribution ID, bucket names) for deploy script consumption

App code (in autri)

  1. Re-enable output: 'standalone' in next.config.mjs
  2. db/client.ts async init refactor (top-level await + env-var branching: DB_SECRET_ARN → fetch secret; else fallback to DATABASE_URL)
  3. Lambda handler wrapper(s) — translate Function URL events → Next request handler (~30 lines)
  4. Chat Lambda build configuration (separate package; copies /api/chat/route.ts + traced deps)
  5. JWT audience validation in Main Lambda (require aud=app.autri.ai)
  6. JWT audience validation in Chat Lambda (require aud=app.autri.ai)
  7. Connector creation server action → invokes Connector-Mgmt Lambda via lambda:Invoke
  8. Remove /api/cache/[...path]/route.ts from autri (CloudFront serves direct from S3)

Build/Deploy (in autri)

  1. pnpm deploy:web script:
    • Build static bundle + Main Lambda artifact + Chat Lambda artifact
    • Sync static bundle to S3 with cache headers
    • Upload Lambda artifacts → new versions
    • Alias-promote: prev ← old current, then current ← new
    • CloudFront cache invalidation for /static/* paths
  2. pnpm rollback:web script — swaps current and prev aliases
  3. Build command pinned per Q9: pnpm install --frozen-lockfile && pnpm --filter @autri/app build && pnpm deploy --filter @autri/app --prod /tmp/standalone-build

Observability — critical alarms

  1. Lambda CloudWatch Log Groups for Main + Chat + Connector-Mgmt, 30-day retention
  2. CloudWatch alarm: Main Lambda error rate > 5% over 5 min → existing AWS Budgets SNS topic
  3. CloudWatch alarm: Chat Lambda error rate > 5% over 5 min → existing AWS Budgets SNS topic
  4. CloudWatch alarm: RDS active-connections > 70% of max_connections → existing SNS topic

Documentation

  1. docs/deploy.md (or README section) — deploy + rollback procedures
  2. CDK comment + project README: "Cloudflare must stay DNS-only (gray cloud); only CloudFront does CDN duty"

Out of Scope (cut from W3; explicitly v1.1+)

These are tracked in the sub-system doc's Known Issues table. Do not pull into W3 scope without amending this requirements doc.

ItemWhy cutTrigger to revisit
Parallel app-w3.autri.ai validation domainNo production users to protect; rollback to Amplify isn't a useful escape (Amplify can't reach DB anyway); ~3-4 hours of theaterIf real users land before W3 and we need pre-cutover validation
Connector "Rotate secret" button (Q16)Beta users have 1-2 connectors; delete+recreate is workable recovery (~30 sec UX); ~2 hours of UI workFirst beta user complains about losing a client_secret
4 additional CloudWatch alarms (CloudFront 5xx, CloudFront 4xx, concurrent executions, ENI count)Ship 3 critical now; layer rest on in a 1-hour follow-up sessionAfter W3 launch session completes
Provisioned concurrency for cold-start hardeningBeta scale doesn't justify the $5-15/mo cost yetUser feedback says cold-start UX is bad
RDS ProxyBeta scale (single-digit concurrent Lambdas) is far below the exhaustion thresholdConcurrent Lambda metric trends past 30
Resumable streams for /api/chat60s CloudFront cap is acceptable for most chat turns; ~1-2 day v1.1 liftUser feedback on truncated long responses
Tighter CSP (nonce-based script-src)Beta ships with minimal CSP ('self' 'unsafe-inline'); Next inline-script needs not yet mappedSecurity audit pre-paid-customer
WebSocket origin via APIGWBeta notifications use pollingReal-time UX becomes a v1.1 requirement
GitHub Actions deploy pipelinePhase 2 per Q5 (manual-first per D42 pattern)After W3 has run stable for 1-2 weeks
Bundle splitting beyond Main/ChatQ1' split is the only structural split; further splits are optimizationCold-start measurements show Main bundle still too heavy
Multi-regionus-east-1 only (per existing infra-and-auth-plan constraints)First paying customer pushes for latency/compliance
Bedrock LLM cutoverBeta uses Anthropic API direct (per existing decisions)v1.1
Multi-tenancy RLS enforcementBeta uses scope checks at the request layer only (D13 pending)v1.1 hard blocker per D13
Stripe / paid tier enforcementBeta is free for allowlisted usersPost-MVP
Mobile-responsive polishOut of beta per existing decisionsPost-beta

Sequencing — first task

Bundle measurement completed 2026-05-26. Findings + locked decision:

Total .next/standalone/ = 235MB unzipped, broken down:

  • ingestion/cache/ = 197MB — local-dev artifacts traced by Next because of /api/cache/[...path] route. Evaporates when Q11 removes the route (item 20 in the In Scope list).
  • node_modules/ = 35MB — Next runtime (21M) + native binaries + small deps. But missing critical workspace-hoisted deps (AI SDK, Anthropic SDK, drizzle-orm) due to hoisted pnpm + Next's outputFileTracing not following workspace symlinks. Pattern #2 fix via pnpm deploy --prod (Q9) resolves this.
  • app/ = 2.8MB — traced server code

Q8 locked: container image (not zip). Three signals:

  1. Even with cache evaporated + missing deps resolved, the bundle lands in the 50-70MB range — on the edge of the 50MB zip cap. Future deps push it over.
  2. The pnpm deploy --prod packaging step we need anyway produces container-image-friendly output. Matches the MCP server pattern.
  3. Cold-start cost of container image (~150ms) is small vs the ~500-800ms cold-start floor we already accepted.

Implementation order from here:

  1. CDK Lambda construct design — Main + Chat + Connector-Management Lambdas, all as container images. ECR repos created for each in auth-and-compute stack (or new web stack).
  2. App code refactorsdb/client.ts async init + env-var branching; Lambda handler wrappers; Chat Lambda separate build package; JWT audience validation in both lambdas; remove /api/cache/[...path] route handler.
  3. Build/deploy scriptpnpm deploy:web runs build + Docker build + ECR push + Lambda update + alias promote + CloudFront invalidation.
  4. CloudFront distribution — 5 behaviors, OAC for static + cache buckets, Response Headers Policy, Origin Request + Cache Policies pinned to AWS-managed combos.
  5. Cognito resource server provisioningapp.autri.ai resource server (NEW); verify mcp.autri.ai exists.
  6. Smoke test via aws lambda invoke + curl -k against raw CloudFront URL (pre-cutover validation per blue-team strategy).
  7. DNS swap — Cloudflare app CNAME repointed to new CloudFront distribution.
  8. Post-swap verification — login + chat + inspector + connector creation E2E.
  9. Amplify CDK teardown — same session, post-verification.
  10. Three critical CloudWatch alarms + Log Group retention configured.
  11. Documentation — deploy + rollback procedures; Cloudflare-DNS-only constraint.

Detailed sequencing belongs in the implementation session's working notes, not this requirements doc.

Pre-Cutover Validation Strategy

With parallel domain cut, validation happens via:

  1. aws lambda invoke against each deployed Lambda directly — verifies cold-start path, secret fetch, DB connection, and route dispatch with synthetic events
  2. curl -k against raw CloudFront distribution URL (https://d12345.cloudfront.net/) — verifies CloudFront routing + static asset behavior + Function URL plumbing (cert errors expected since cert is bound to app.autri.ai)
  3. Local dev (pnpm dev) continues to verify app code correctness independently
  4. DNS swap → immediate browser smoke test on app.autri.ai — real auth E2E. If broken, fix forward (rollback to Amplify is not a working state for DB-touching routes).

Dependencies

  • EPIC-4 Days 8-11 stack must be live (NetworkAndData, AuthAndCompute, Monitoring) — confirmed live as of 2026-05-26
  • RDS migrations applied through the CDK custom-resource Lambda — confirmed
  • Cognito user pool live at auth.autri.ai — confirmed
  • Cache S3 bucket provisioned in NetworkAndData — confirmed; cache files appear at runtime when users ingest docs (ingestion sub-system writes to S3; verify behavior during W3 implementation, escalate to ingestion sub-system if broken)
  • Cognito allowed callback URLs include https://app.autri.ai/api/auth/callback — confirmed (no change needed since direct cutover)

Risks (load-bearing, from sub-system doc)

Truncated list of the highest-impact risks that warrant active attention during implementation. Full risk table lives in sub-system doc § Risks & Constraints.

  1. Main Lambda bundle exceeds zip cap. Measure first (see Sequencing). Container image is the escape hatch.
  2. db/client.ts async init forces refactor across call sites. Validate top-level await early; lazy proxy is the backup.
  3. CloudFront cache-key trap. Pin AWS-managed AllViewerExceptHostHeader + CachingDisabled for Lambda origins. Verified in CDK code review.
  4. /api/chat truncation at 60s. Documented limit for beta; resumable streams is v1.1 if usage shows truncation.
  5. DNS propagation post-swap. Fix-forward strategy since Amplify rollback isn't useful.
  6. Connector-Management Lambda IAM scope. Must be tightly scoped to user pool ARN with only the 3 named verbs.

Amendments

DateAmendmentReasonStatus
2026-05-26Initial requirements lock via /hl:blue-team against web-stack-w3 design docScope contract for the W3 implementation workActive

(Future scope changes get rows here. Don't silently expand scope — amend and re-anchor.)


Implementation work covers ~10-15 hours of effort per D43's estimate. Live tracking happens in the next session's working notes + next.md handoff between sessions, not in this requirements doc. Update this doc only when scope itself changes.

Review

🔒

Enter your access token to view annotations