Foundry Foundry

EPIC-4: AWS Production Deploy

Drafted 2026-05-19. Beta-sprint epic 4 of 5. Sequencing: Week 2 Days 8-12. Depends on EPIC-1 (Stack B locked), EPIC-2 (schema), EPIC-3 (MCP server working locally).

Implementation plan

Out of scope

Goal

Migrate the local-first MVP to AWS, hosted on Stack B. End state: app.autri.ai serves the Next.js app via Amplify, mcp.autri.ai hosts the MCP server on AgentCore Runtime, RDS holds production data, both authenticated via Cognito with Google federation.

Why this epic exists

Local MVP proves the product works. AWS deploy makes it real for beta users. This epic moves us from "Dan's laptop" to "publicly accessible web product" without changing the substance.

Scope (in)

CDK project location: separate autri-infra repo (clean blast-radius separation from the app repo; Amplify only watches autri repo).

CDK stack organization: 3 stacks in autri-infra:

  1. network-and-data — VPC + subnets (2 AZs) + security groups + RDS Postgres 16 + pgvector + S3 buckets
  2. auth-and-compute — Cognito user pool config + AgentCore Runtime endpoint + IAM roles + Amplify wiring + Parameter Store + Secrets Manager seeds + ACM certs (app.autri.ai, mcp.autri.ai, auth.autri.ai)
  3. monitoring — CloudWatch dashboards + log groups (30-day retention) + AWS Budgets + Logs Insights saved queries

Per-stack contents:

  • network-and-data:

    • VPC + subnets (2 AZs) + security groups
    • RDS Postgres 16 + pgvector via custom parameter group with shared_preload_libraries=vector; first migration runs CREATE EXTENSION IF NOT EXISTS vector;
    • db.t4g.small, single-AZ for beta (Multi-AZ when first paying customer)
    • S3 buckets: uploads, page renders, cache, ingestion artifacts, feedback-screenshots (for in-app feedback feature)
    • 7-day automated RDS backups (default; revisit when paying customers)
  • auth-and-compute:

    • Cognito user pool (re-provisioned in CDK; spike pool from EPIC-1 torn down per Day 8 Step 0)
    • Cognito custom domain auth.autri.ai (requires ACM cert in us-east-1 — Cognito custom domain requirement)
    • AgentCore Runtime endpoint configuration (arm64 container, per EPIC-1)
    • IAM roles for Amplify + AgentCore Runtime + Fargate Tasks
    • Amplify app wired to GitHub autri repo, custom domain app.autri.ai
    • Secrets Manager: credentials (DB password, Anthropic API key, OAuth shared secret, GitHub PAT for issues API)
    • Parameter Store: config (DB hostname, region, Cognito IDs, feature flags, ALLOWED_EMAILS allowlist for post-confirm Lambda)
    • ACM certs for app.autri.ai, mcp.autri.ai, auth.autri.ai
  • monitoring:

    • CloudWatch dashboard: per-stack-layer cost, MCP session counts, RDS metrics, chat query counts
    • CloudWatch Log Groups for app + MCP server, 30-day retention (not default never-expire)
    • AWS Budgets alerts armed at $50/$100/$200/mo
    • Logs Insights saved queries: error rate, top errors, MCP tool call distribution
    • (Cost Anomaly Detection cut per requirements blue-team — AWS Budgets thresholds sufficient)

Amplify project:

  • Connected to GitHub autri repo
  • Auto-build on push to main
  • Next.js 14 App Router build config: explicit next build + .next/standalone output; AWS Amplify auto-detection may not handle App Router correctly — verify Day 9 with Amplify build settings explicitly set
  • Environment variables / secrets wired in via Parameter Store references
  • Custom domain: app.autri.ai

Cognito user pool (re-provisioned in CDK):

  • Custom domain auth.autri.ai (replaces ugly <pool-id>.auth.us-east-1.amazoncognito.com)
  • Google federated IdP configured (re-points from EPIC-1 spike Google OAuth app)
  • Resource server config for mcp.autri.ai with custom scopes
  • Hosted UI default for beta (no custom Amplify Auth components in scope; documented as v1.1 polish)

Post-confirmation Lambda (auto-provisioning + allowlist enforcement):

  • On PostConfirmation Cognito event:
    • Step 1: allowlist check. Read ALLOWED_EMAILS from Parameter Store; if user's email not in list, reject signup (AdminDeleteUser to remove the Cognito user record) and log the rejection. Return error to Cognito (user sees "not invited to beta" message).
    • Step 2: auto-provisioning (only if allowlisted). Same logic as EPIC-2 backfill: create personal org + Personal library + library_access. Idempotent.
    • Step 3: welcome notification. Insert a notifications row (type=welcome, title="Welcome to Autri", body=…, link=/help/claude-desktop).
  • Lambda failures alert via CloudWatch metric; manual cleanup script as fallback

DNS + TLS:

  • Cloudflare CNAMEs: app.autri.ai → Amplify CloudFront; mcp.autri.ai → AgentCore endpoint; auth.autri.ai → Cognito custom domain
  • ACM cert validation via DNS records (request Day 9 start; add validation CNAMEs immediately for parallel propagation)
  • Verify HTTPS works for all three subdomains
  • Cloudflare's free-tier DDoS protection covers baseline attack volumes

Email infrastructure (CUT per requirements blue-team — replaced by in-app notifications in EPIC-5):

  • AWS SES domain identity for autri.ai
  • SPF, DKIM, DMARC records
  • SES sandbox-out request
  • sendEmail() wrapper
  • noreply@autri.ai sending address

AgentCore container deploy pipeline:

  • GitHub Actions on push to mcp-servers/doc-search/** builds arm64 image, pushes to ECR, triggers AgentCore Runtime update
  • ECR repository created in network-and-data stack
  • Workflow file: .github/workflows/deploy-mcp.yml in autri repo

Database migration:

  • pg_dump from local Docker Postgres
  • Restore to RDS (after CREATE EXTENSION vector runs)
  • Verify counts, sample queries match local state
  • Update connection string in Amplify env vars (via Parameter Store reference)

Bedrock model-access approval (fire early Day 8):

  • Request access to anthropic.claude-sonnet-4-6-20251001-v1:0 and anthropic.claude-haiku-4-5-20251001-v1:0 (specific model IDs)
  • 24-48h async approval
  • Pre-write use-case description so submission is fast: "Autri is a knowledge-base platform serving 5-10 beta users. We use Sonnet 4.6 for chat routing and tool orchestration, Haiku 4.5 for document extraction. Expected usage: ~10k requests/month."

Cost telemetry + observability:

  • All resources tagged (project=autri env=beta cost-bucket=<layer>)
  • Already covered in monitoring stack above

Out of scope

  • Bedrock cutover for LLM traffic (Anthropic API direct still serves beta; flip to Bedrock in v1.1)
  • Multi-region (us-east-1 only)
  • Custom Cognito hosted UI (use AWS default for v1)
  • WAF / DDoS protection (defer until paying customers)
  • Mobile-responsive app polish (defer)
  • Stripe / paid tier enforcement
  • Auto-scaling tuning (default min=1/max=1 task for Fargate ingestion; default warm pool for AgentCore)
  • OAuth metadata proxy for Cognito (F2 fix) — DEFERRED to v1.1 per D41 (2026-05-24). Beta MCP UX is power-user manual-config in Claude Desktop Advanced settings (paste server URL + client_id + client_secret + token). MCP-spec clients that require RFC 8414 OAuth authorization-server metadata discovery (Claude.ai Custom Connector, mcp-remote) stay broken for beta. The Lambda + API Gateway proxy that serves RFC 8414 + RFC 9728 PRM endpoints, plus any DCR (RFC 7591) shim work, all move to v1.1.

Dependencies

  • EPIC-1 — Stack B validated, AWS account hygiene done, Cognito + Budgets in place
  • EPIC-2 + EPIC-3 — local MVP working end-to-end (so we know what we're deploying)
  • Bedrock model-access approval lead time: 24-48h, request on Day 8 to be safe

Deliverables

  • app.autri.ai serving the Next.js app over HTTPS via Amplify
  • mcp.autri.ai hosting AgentCore MCP server, authenticated via Cognito
  • RDS database with all dev data migrated, integrity verified
  • Cost dashboards live with per-stack-layer breakdown
  • AWS Budgets alerts armed
  • Bedrock model-access approval submitted (approval pending OK; not blocking)
  • CDK code committed to GitHub autri/infra/
  • Cognito Google federation working through the connector creation flow

Implementation plan

Day 8 — Spike teardown + CDK scaffold + Bedrock approval

Step 0 (~15 min): Tear down EPIC-1 spike artifacts. Console-delete the spike AgentCore endpoint + ECR repo + spike IAM role + M2M test client (autri-spike-m2m-test) + CloudWatch dashboard. Keep the spike Cognito user pool alive through Day 9 — Day 8.5's local validation of CognitoJwksAuth (already shipped in commit 198ee6a) needs the spike pool's JWKS endpoint until the CDK-provisioned prod pool replaces it. Update the Google Cloud Console OAuth app's redirect URI placeholder — will re-point to the CDK-provisioned Cognito custom domain (auth.autri.ai) in Day 9. Keep: ACM cert for mcp.autri.ai, Cloudflare DNS validation CNAMEs.

  1. Create autri-infra repo (private until v1.1), scaffold CDK project (TypeScript)
  2. Define network-and-data stack: VPC + RDS + custom parameter group with shared_preload_libraries=vector + S3 buckets (including feedback-screenshots)
  3. cdk deploy autri-network-and-data to autri-prod account
  4. Submit Bedrock model-access approval request for anthropic.claude-sonnet-4-6-20251001-v1:0 + anthropic.claude-haiku-4-5-20251001-v1:0 (use pre-written description)
  5. Run first Drizzle migration against RDS: CREATE EXTENSION IF NOT EXISTS vector; then EPIC-2's library/connector/audit/chat_queries/notifications schema

Submit AWS SES production access request — cut per requirements blue-team (email infrastructure entirely removed; in-app notifications replace email)

Day 8.5 — MCP server AgentCore-readiness passCode complete in commit 198ee6a (2026-05-24).

The code-side AgentCore-readiness lifts from EPIC-1 spike findings (F1, F4, F8) landed in commit 198ee6a:

  1. CognitoJwksAuth swapmcp-servers/doc-search/src/auth.ts has the verbatim spike class + buildAuthVerifier() factory keyed off AGENTCORE_AUTH=hs256-dev|cognito. Default hs256-dev preserves local-dev workflow; production AgentCore Runtime sets AGENTCORE_AUTH=cognito.
  2. Dockerfile — multi-stage arm64 with pnpm deploy --prod for workspace bundling; tsx ships as runtime dep (cold-start cost acceptable for AgentCore microVMs). Build validated: docker build --platform=linux/arm64 → 197MB image, entrypoint runs.
  3. Connector-ID-in-JWT migration (F4)/c/:connectorId/mcp route collapsed to /mcp; connector ID travels as a connector_id JWT claim. Dev: pnpm dev:make-token --connector-id <uuid>. Prod: Cognito Pre-Token-Generation Lambda (now in Day 9 below). Tests pass; UI updated.
  4. OAuth metadata proxy for Cognito (F2)DEFERRED TO v1.1 per D41 (2026-05-24). AgentCore Identity eval confirmed it's a credential vault, not an OAuth authorization server — Lambda + API Gateway proxy is still the right path, but the work is bespoke (~1-2 days) and doesn't compound. Beta MCP UX is power-user manual-config (Claude Desktop Advanced settings: paste connector URL + client_id + client_secret + token). Claude.ai Custom Connector and mcp-remote stay broken for beta. See D41 for full rationale; see autri-beta Amendments table for scope contract.
  5. PKCE enforcement on Cognito app clientmoved to Day 9 infra (CDK config on the CDK-provisioned app client, not the spike pool).
  6. Stdout request logging (F8)mcp-servers/doc-search/src/log.ts; JSON shape parseable via CloudWatch Logs Insights parse @message.

Day 9 — Auth/compute stack + Amplify config

  1. Define auth-and-compute stack: Cognito pool + custom domain (auth.autri.ai) + ACM cert in us-east-1 + AgentCore Runtime + Amplify wiring + Secrets Manager + Parameter Store (including ALLOWED_EMAILS parameter)
  2. Cognito Pre-Token-Generation Lambda (CDK construct) — reads OAuth client_id from the token-mint event, looks up the matching connector_id from the connectors table (via RDS Data API or direct VPC connection), injects connector_id as a custom claim. Required for the migration shipped in Day 8.5 step 3 to work end-to-end in production.
  3. PKCE enforcement on autri-mcp-client app client — CDK config: generateSecret: false, code flow + S256 PKCE required.
  4. cdk deploy autri-auth-and-compute
  5. ACM certs requested for app.autri.ai, mcp.autri.ai, auth.autri.aiimmediately add DNS validation CNAMEs in Cloudflare so they propagate in parallel
  6. Amplify app: connect to GitHub autri repo, explicitly configure Next.js 14 App Router build settings (next build, standalone output), env vars wired via Parameter Store references (including AGENTCORE_AUTH=cognito, COGNITO_ISSUER, COGNITO_REQUIRED_SCOPE, COGNITO_CLIENT_ID for the MCP container's env)
  7. Update Google Cloud Console OAuth app: add redirect URI https://auth.autri.ai/oauth2/idpresponse (from CDK-provisioned Cognito)
  8. Tear down spike Cognito pool (no longer needed once prod pool is live)

Day 10 — Deploy + DNS + monitoring stack

  1. Build + push MCP server arm64 container to ECR (manually first time; GitHub Actions wiring on Day 12)
  2. Deploy to AgentCore Runtime via CDK with requestHeaderConfiguration.requestHeaderAllowlist=["Authorization"] (per F1) and customJWTAuthorizer.discoveryUrl pointing at Cognito's /.well-known/openid-configuration (no proxy in beta per D41)
  3. Provision CloudFront distribution in front of AgentCore Runtime (per F3) — origin path policy rewrites //runtimes/{encoded-arn}/invocations?qualifier=DEFAULT; uses the mcp.autri.ai ACM cert
  4. Push autri app to GitHub main → Amplify builds + deploys
  5. Cloudflare DNS: CNAME app.autri.ai → Amplify; CNAME mcp.autri.aiCloudFront (not AgentCore directly); CNAME auth.autri.ai → Cognito custom domain
  6. ACM cert validation completes (may take minutes-hours)
  7. Smoke test: curl https://app.autri.ai returns app; curl -i https://mcp.autri.ai/mcp returns 401 with WWW-Authenticate header (valid auth challenge per OAuth 2.1 + MCP spec)
  8. Define monitoring stack: CloudWatch dashboards + Budgets + 30-day log retention + Logs Insights queries
  9. cdk deploy autri-monitoring

Day 11 — Database migration + Auth wiring + Cross-domain SSO validation + Allowlist Lambda

  1. pg_dump local Postgres → SQL file
  2. Restore to RDS: psql ... -f dump.sql (extension already enabled in Day 8)
  3. Verify: counts of users, KBs, libraries, connectors, chat_queries, notifications schema match local
  4. Update Amplify env vars with RDS connection string (via Parameter Store reference)
  5. Post-confirmation Lambda deployed with three-step logic:
    • Allowlist check (reject + delete Cognito user if email not in ALLOWED_EMAILS)
    • Auto-provision personal org + Personal library + library_access (if allowlisted)
    • Insert welcome notification row
  6. Combined Day 11 validation: real Cognito flow + cross-domain SSO + allowlist + power-user manual-config MCP flow:
    • Add Dan's email to ALLOWED_EMAILS Parameter Store
    • Open app.autri.ai → login with Google (validates Cognito custom domain + allowlist)
    • Land on dashboard → confirm welcome notification appears in bell UI
    • Generate a connector → see endpoint URL + client_id + client_secret
    • Mint a bearer token from the connector credentials (Cognito token endpoint, client_credentials grant; Pre-Token-Gen Lambda injects connector_id claim)
    • Configure Claude Desktop Advanced settings manually with: server URL (https://mcp.autri.ai/mcp), client_id, client_secret, bearer token. This is the beta MCP UX per D41 — no RFC 8414 discovery, no DCR.
    • Invoke a tool from Claude Desktop, confirm successful query against the deployed AgentCore Runtime
    • Verify the Cognito-issued JWT validates against mcp.autri.ai's resource server (cross-domain SSO works)
    • Test allowlist rejection: sign in with a non-allowlisted Google account, confirm rejection + Cognito user cleanup

Day 12 — Cost telemetry verification + CI wiring

  1. CloudWatch dashboard verified: per-stack-layer cost, MCP session counts, RDS metrics, chat query counts, notifications counts
  2. AWS Budgets alerts at $50/$100/$200/mo confirmed firing (small test invocation)
  3. GitHub Actions workflow (.github/workflows/deploy-mcp.yml) deployed: builds arm64 image on push to mcp-servers/doc-search/**, pushes to ECR, triggers AgentCore update
  4. Verify resource tags applied to everything via Cost Explorer breakdown

Day 12 email infrastructure end-to-end test — cut per requirements blue-team

Risks

  • ACM cert provisioning timing — DNS validation can take 5 min to 24h. Mitigation: request certs at start of Day 9 + immediately add validation CNAMEs in Cloudflare for parallel propagation.
  • pg_dump/restore cutover requires brief downtime. Acceptable for beta (no users yet). Mitigation: communicate in advance if anyone is testing; verify integrity post-restore before re-pointing app at RDS.
  • Amplify build for Next.js 14 App Router may have config quirks first time. Mitigation: explicit build config on Day 9; fallback plan to deploy via S3 + CloudFront manually if Amplify build fails repeatedly (~1 day of fallback work documented but not pre-built).
  • Bedrock model-access approval is asynchronous (24-48h) — don't block deploy on it. Anthropic API direct continues to serve LLM traffic until Bedrock approves. If approval denied or info requested, stay on Anthropic API and resubmit with more detail.
  • AWS SES production access (out of sandbox) is asynchronous. If still in sandbox by Day 13, beta users' emails must be pre-verified individually — workable for 3-user beta but flag in EPIC-5.
  • AgentCore Runtime container architecture locked as arm64 (cheaper). Build pipeline must match (GitHub Actions buildx for arm64).
  • Cognito Google federation config errors — IdP setup has many small fields. Mitigation: validated in EPIC-1 spike; CDK re-provisioning may surface field-mapping bugs that didn't exist in console-clicked spike. Allow buffer.
  • Cognito custom domain auth.autri.ai ACM cert must be in us-east-1 (Cognito requirement, regardless of where the user pool is). Two cert requests in us-east-1 (the custom-domain one + the app/mcp ones if pool is also us-east-1).
  • pgvector extension version pin. RDS Postgres 16 ships pgvector ≥0.5.0 (verify in parameter group selection). If we need newer features, may need to upgrade extension via ALTER EXTENSION vector UPDATE; post-deploy.
  • Post-confirmation Lambda trigger error handling. If the Lambda fails on a user signup, the user is created in Cognito but has no Autri-side records (broken state). Mitigation: Lambda must be idempotent + alert on failures; manual cleanup script as fallback.
  • Cross-domain Cognito SSO failure. Day 11 validation may surface that the same Cognito JWT doesn't validate cleanly against both subdomains (audience/issuer mismatch). Mitigation: if fails, configure separate OAuth resource servers per subdomain.

Definition of done

  • EPIC-1 spike artifacts torn down (Day 8 Step 0)
  • CDK project committed to autri-infra repo; 3 stacks deployed (network-and-data, auth-and-compute, monitoring)
  • RDS Postgres 16 + pgvector live (custom parameter group + CREATE EXTENSION migration ran successfully)
  • All EPIC-2 tables created on RDS: libraries, library_kbs, library_access, connectors, mcp_audit_log, chat_queries, notifications
  • Amplify deploys Next.js 14 App Router app on push to main; live at app.autri.ai over HTTPS
  • AgentCore Runtime hosts MCP server (arm64); live at mcp.autri.ai over HTTPS
  • Cognito custom domain auth.autri.ai live (replaces ugly default Cognito URL)
  • Cognito Google federation works end-to-end through connector creation flow
  • Cross-domain Cognito SSO verified: same JWT validates against both app.autri.ai and mcp.autri.ai
  • Stub-auth removed from MCP server (AGENTCORE_AUTH=none flag deleted; real Cognito JWKS validation active)
  • Post-confirmation Lambda deployed with allowlist check + auto-provisioning + welcome notification insert
  • ALLOWED_EMAILS Parameter Store populated with Dan + mom + STEM Racing engineer emails
  • Allowlist rejection tested (non-allowlisted Google account is denied + Cognito user record deleted)
  • Data migrated from local Postgres to RDS, integrity verified (counts match)
  • GitHub Actions workflow deploys MCP container on push to mcp-servers/doc-search/**
  • Cost dashboards live with per-stack-layer breakdown
  • AWS Budgets alerts armed at $50/$100/$200/mo
  • CloudWatch Log Groups have 30-day retention (not never-expire default)
  • CDK code committed to autri-infra GitHub
  • Bedrock model-access approval submitted for Sonnet 4.6 + Haiku 4.5 (approval status logged)
  • All resources tagged correctly (project=autri env=beta cost-bucket=<layer>)
  • CloudWatch Logs aggregated; Logs Insights queries saved (error rate, top errors, MCP tool calls)

Notes / open questions

Locked this triage pass (2026-05-19):

  • CDK in separate autri-infra repo (not in app repo)
  • 3 CDK stacks: network-and-data / auth-and-compute / monitoring
  • RDS pgvector via custom parameter group + CREATE EXTENSION in first migration
  • auth.autri.ai = Cognito custom domain (ACM cert in us-east-1)
  • Cognito hosted UI default for beta (no custom Amplify Auth UI; v1.1 polish)
  • Cross-domain SSO test explicit in Day 11 (app.autri.aimcp.autri.ai)
  • Stub-auth removal as combined Day 11 step
  • AgentCore container CI via GitHub Actions on push to mcp-servers/doc-search/**
  • Secrets Manager for credentials, Parameter Store for config (including ALLOWED_EMAILS)
  • Bedrock model IDs specified: claude-sonnet-4-6-20251001-v1:0 + claude-haiku-4-5-20251001-v1:0
  • CloudWatch log retention set to 30 days
  • Bedrock approval use-case description pre-written (see Scope)
  • Post-confirmation Lambda does allowlist check + auto-provisioning + welcome notification insert
  • Day 8 Step 0: explicit spike artifact teardown
  • Container architecture: arm64 (locked from EPIC-1)

Cut by requirements blue-team (2026-05-19):

  • AWS SES + DNS verification + sending wrapper — replaced by in-app notifications (EPIC-2 schema + EPIC-5 UI)
  • SES production access request — no email infrastructure at all
  • beta@autri.ai outbound alias — replaced by mailto: link on login page (no infra needed)
  • Cost Anomaly Detection daily email — AWS Budgets thresholds sufficient

Still open (decide during implementation):

  • RDS instance class — t4g.small for beta; watch metrics and scale to t4g.medium during high-traffic ingestion if needed
  • Whether autri-infra repo should be public or private — lean private until v1.1
  • GitHub Personal Access Token vs GitHub App for the feedback-issues integration (EPIC-5) — lean PAT for v1, GitHub App when team grows
  • Amplify SSR Lambda memory/timeout — start at defaults, tune if cold start UX suffers
  • ALLOWED_EMAILS Parameter Store update flow — manual edit + Lambda redeploy for v1; automate when allowlist grows past 10 users

Review

🔒

Enter your access token to view annotations