EPIC-4: AWS Production Deploy
Drafted 2026-05-19. Beta-sprint epic 4 of 5. Sequencing: Week 2 Days 8-12. Depends on EPIC-1 (Stack B locked), EPIC-2 (schema), EPIC-3 (MCP server working locally).
Implementation plan
Out of scope
Goal
Migrate the local-first MVP to AWS, hosted on Stack B. End state: app.autri.ai serves the Next.js app via Amplify, mcp.autri.ai hosts the MCP server on AgentCore Runtime, RDS holds production data, both authenticated via Cognito with Google federation.
Why this epic exists
Local MVP proves the product works. AWS deploy makes it real for beta users. This epic moves us from "Dan's laptop" to "publicly accessible web product" without changing the substance.
Scope (in)
CDK project location: separate autri-infra repo (clean blast-radius separation from the app repo; Amplify only watches autri repo).
CDK stack organization: 3 stacks in autri-infra:
network-and-data— VPC + subnets (2 AZs) + security groups + RDS Postgres 16 + pgvector + S3 bucketsauth-and-compute— Cognito user pool config + AgentCore Runtime endpoint + IAM roles + Amplify wiring + Parameter Store + Secrets Manager seeds + ACM certs (app.autri.ai,mcp.autri.ai,auth.autri.ai)monitoring— CloudWatch dashboards + log groups (30-day retention) + AWS Budgets + Logs Insights saved queries
Per-stack contents:
-
network-and-data:- VPC + subnets (2 AZs) + security groups
- RDS Postgres 16 + pgvector via custom parameter group with
shared_preload_libraries=vector; first migration runsCREATE EXTENSION IF NOT EXISTS vector; - db.t4g.small, single-AZ for beta (Multi-AZ when first paying customer)
- S3 buckets: uploads, page renders, cache, ingestion artifacts, feedback-screenshots (for in-app feedback feature)
- 7-day automated RDS backups (default; revisit when paying customers)
-
auth-and-compute:- Cognito user pool (re-provisioned in CDK; spike pool from EPIC-1 torn down per Day 8 Step 0)
- Cognito custom domain
auth.autri.ai(requires ACM cert in us-east-1 — Cognito custom domain requirement) - AgentCore Runtime endpoint configuration (arm64 container, per EPIC-1)
- IAM roles for Amplify + AgentCore Runtime + Fargate Tasks
- Amplify app wired to GitHub
autrirepo, custom domainapp.autri.ai - Secrets Manager: credentials (DB password, Anthropic API key, OAuth shared secret, GitHub PAT for issues API)
- Parameter Store: config (DB hostname, region, Cognito IDs, feature flags,
ALLOWED_EMAILSallowlist for post-confirm Lambda) - ACM certs for
app.autri.ai,mcp.autri.ai,auth.autri.ai
-
monitoring:- CloudWatch dashboard: per-stack-layer cost, MCP session counts, RDS metrics, chat query counts
- CloudWatch Log Groups for app + MCP server, 30-day retention (not default never-expire)
- AWS Budgets alerts armed at $50/$100/$200/mo
- Logs Insights saved queries: error rate, top errors, MCP tool call distribution
- (Cost Anomaly Detection cut per requirements blue-team — AWS Budgets thresholds sufficient)
Amplify project:
- Connected to GitHub
autrirepo - Auto-build on push to main
- Next.js 14 App Router build config: explicit
next build+.next/standaloneoutput; AWS Amplify auto-detection may not handle App Router correctly — verify Day 9 with Amplify build settings explicitly set - Environment variables / secrets wired in via Parameter Store references
- Custom domain:
app.autri.ai
Cognito user pool (re-provisioned in CDK):
- Custom domain
auth.autri.ai(replaces ugly<pool-id>.auth.us-east-1.amazoncognito.com) - Google federated IdP configured (re-points from EPIC-1 spike Google OAuth app)
- Resource server config for
mcp.autri.aiwith custom scopes - Hosted UI default for beta (no custom Amplify Auth components in scope; documented as v1.1 polish)
Post-confirmation Lambda (auto-provisioning + allowlist enforcement):
- On
PostConfirmationCognito event:- Step 1: allowlist check. Read
ALLOWED_EMAILSfrom Parameter Store; if user's email not in list, reject signup (AdminDeleteUserto remove the Cognito user record) and log the rejection. Return error to Cognito (user sees "not invited to beta" message). - Step 2: auto-provisioning (only if allowlisted). Same logic as EPIC-2 backfill: create personal org + Personal library + library_access. Idempotent.
- Step 3: welcome notification. Insert a
notificationsrow (type=welcome, title="Welcome to Autri", body=…, link=/help/claude-desktop).
- Step 1: allowlist check. Read
- Lambda failures alert via CloudWatch metric; manual cleanup script as fallback
DNS + TLS:
- Cloudflare CNAMEs:
app.autri.ai→ Amplify CloudFront;mcp.autri.ai→ AgentCore endpoint;auth.autri.ai→ Cognito custom domain - ACM cert validation via DNS records (request Day 9 start; add validation CNAMEs immediately for parallel propagation)
- Verify HTTPS works for all three subdomains
- Cloudflare's free-tier DDoS protection covers baseline attack volumes
Email infrastructure (CUT per requirements blue-team — replaced by in-app notifications in EPIC-5):
AWS SES domain identity forautri.aiSPF, DKIM, DMARC recordsSES sandbox-out requestsendEmail()wrappernoreply@autri.aisending address
AgentCore container deploy pipeline:
- GitHub Actions on push to
mcp-servers/doc-search/**builds arm64 image, pushes to ECR, triggers AgentCore Runtime update - ECR repository created in
network-and-datastack - Workflow file:
.github/workflows/deploy-mcp.ymlinautrirepo
Database migration:
- pg_dump from local Docker Postgres
- Restore to RDS (after
CREATE EXTENSION vectorruns) - Verify counts, sample queries match local state
- Update connection string in Amplify env vars (via Parameter Store reference)
Bedrock model-access approval (fire early Day 8):
- Request access to
anthropic.claude-sonnet-4-6-20251001-v1:0andanthropic.claude-haiku-4-5-20251001-v1:0(specific model IDs) - 24-48h async approval
- Pre-write use-case description so submission is fast: "Autri is a knowledge-base platform serving 5-10 beta users. We use Sonnet 4.6 for chat routing and tool orchestration, Haiku 4.5 for document extraction. Expected usage: ~10k requests/month."
Cost telemetry + observability:
- All resources tagged (
project=autri env=beta cost-bucket=<layer>) - Already covered in
monitoringstack above
Out of scope
- Bedrock cutover for LLM traffic (Anthropic API direct still serves beta; flip to Bedrock in v1.1)
- Multi-region (us-east-1 only)
- Custom Cognito hosted UI (use AWS default for v1)
- WAF / DDoS protection (defer until paying customers)
- Mobile-responsive app polish (defer)
- Stripe / paid tier enforcement
- Auto-scaling tuning (default min=1/max=1 task for Fargate ingestion; default warm pool for AgentCore)
- OAuth metadata proxy for Cognito (F2 fix) — DEFERRED to v1.1 per D41 (2026-05-24). Beta MCP UX is power-user manual-config in Claude Desktop Advanced settings (paste server URL + client_id + client_secret + token). MCP-spec clients that require RFC 8414 OAuth authorization-server metadata discovery (Claude.ai Custom Connector,
mcp-remote) stay broken for beta. The Lambda + API Gateway proxy that serves RFC 8414 + RFC 9728 PRM endpoints, plus any DCR (RFC 7591) shim work, all move to v1.1.
Dependencies
- EPIC-1 — Stack B validated, AWS account hygiene done, Cognito + Budgets in place
- EPIC-2 + EPIC-3 — local MVP working end-to-end (so we know what we're deploying)
- Bedrock model-access approval lead time: 24-48h, request on Day 8 to be safe
Deliverables
app.autri.aiserving the Next.js app over HTTPS via Amplifymcp.autri.aihosting AgentCore MCP server, authenticated via Cognito- RDS database with all dev data migrated, integrity verified
- Cost dashboards live with per-stack-layer breakdown
- AWS Budgets alerts armed
- Bedrock model-access approval submitted (approval pending OK; not blocking)
- CDK code committed to GitHub
autri/infra/ - Cognito Google federation working through the connector creation flow
Implementation plan
Day 8 — Spike teardown + CDK scaffold + Bedrock approval
Step 0 (~15 min): Tear down EPIC-1 spike artifacts. Console-delete the spike AgentCore endpoint + ECR repo + spike IAM role + M2M test client (autri-spike-m2m-test) + CloudWatch dashboard. Keep the spike Cognito user pool alive through Day 9 — Day 8.5's local validation of CognitoJwksAuth (already shipped in commit 198ee6a) needs the spike pool's JWKS endpoint until the CDK-provisioned prod pool replaces it. Update the Google Cloud Console OAuth app's redirect URI placeholder — will re-point to the CDK-provisioned Cognito custom domain (auth.autri.ai) in Day 9. Keep: ACM cert for mcp.autri.ai, Cloudflare DNS validation CNAMEs.
- Create
autri-infrarepo (private until v1.1), scaffold CDK project (TypeScript) - Define
network-and-datastack: VPC + RDS + custom parameter group withshared_preload_libraries=vector+ S3 buckets (includingfeedback-screenshots) cdk deploy autri-network-and-datato autri-prod account- Submit Bedrock model-access approval request for
anthropic.claude-sonnet-4-6-20251001-v1:0+anthropic.claude-haiku-4-5-20251001-v1:0(use pre-written description) - Run first Drizzle migration against RDS:
CREATE EXTENSION IF NOT EXISTS vector;then EPIC-2's library/connector/audit/chat_queries/notifications schema
Submit AWS SES production access request — cut per requirements blue-team (email infrastructure entirely removed; in-app notifications replace email)
Day 8.5 — MCP server AgentCore-readiness pass ✅ Code complete in commit 198ee6a (2026-05-24).
The code-side AgentCore-readiness lifts from EPIC-1 spike findings (F1, F4, F8) landed in commit 198ee6a:
- ✅ CognitoJwksAuth swap —
mcp-servers/doc-search/src/auth.tshas the verbatim spike class +buildAuthVerifier()factory keyed offAGENTCORE_AUTH=hs256-dev|cognito. Defaulths256-devpreserves local-dev workflow; production AgentCore Runtime setsAGENTCORE_AUTH=cognito. - ✅ Dockerfile — multi-stage arm64 with
pnpm deploy --prodfor workspace bundling;tsxships as runtime dep (cold-start cost acceptable for AgentCore microVMs). Build validated:docker build --platform=linux/arm64→ 197MB image, entrypoint runs. - ✅ Connector-ID-in-JWT migration (F4) —
/c/:connectorId/mcproute collapsed to/mcp; connector ID travels as aconnector_idJWT claim. Dev:pnpm dev:make-token --connector-id <uuid>. Prod: Cognito Pre-Token-Generation Lambda (now in Day 9 below). Tests pass; UI updated. - ❌ OAuth metadata proxy for Cognito (F2) — DEFERRED TO v1.1 per D41 (2026-05-24). AgentCore Identity eval confirmed it's a credential vault, not an OAuth authorization server — Lambda + API Gateway proxy is still the right path, but the work is bespoke (~1-2 days) and doesn't compound. Beta MCP UX is power-user manual-config (Claude Desktop Advanced settings: paste connector URL + client_id + client_secret + token). Claude.ai Custom Connector and
mcp-remotestay broken for beta. See D41 for full rationale; see autri-beta Amendments table for scope contract. - → PKCE enforcement on Cognito app client — moved to Day 9 infra (CDK config on the CDK-provisioned app client, not the spike pool).
- ✅ Stdout request logging (F8) —
mcp-servers/doc-search/src/log.ts; JSON shape parseable via CloudWatch Logs Insightsparse @message.
Day 9 — Auth/compute stack + Amplify config
- Define
auth-and-computestack: Cognito pool + custom domain (auth.autri.ai) + ACM cert in us-east-1 + AgentCore Runtime + Amplify wiring + Secrets Manager + Parameter Store (includingALLOWED_EMAILSparameter) - Cognito Pre-Token-Generation Lambda (CDK construct) — reads OAuth
client_idfrom the token-mint event, looks up the matchingconnector_idfrom theconnectorstable (via RDS Data API or direct VPC connection), injectsconnector_idas a custom claim. Required for the migration shipped in Day 8.5 step 3 to work end-to-end in production. - PKCE enforcement on
autri-mcp-clientapp client — CDK config:generateSecret: false, code flow + S256 PKCE required. cdk deploy autri-auth-and-compute- ACM certs requested for
app.autri.ai,mcp.autri.ai,auth.autri.ai— immediately add DNS validation CNAMEs in Cloudflare so they propagate in parallel - Amplify app: connect to GitHub
autrirepo, explicitly configure Next.js 14 App Router build settings (next build, standalone output), env vars wired via Parameter Store references (includingAGENTCORE_AUTH=cognito,COGNITO_ISSUER,COGNITO_REQUIRED_SCOPE,COGNITO_CLIENT_IDfor the MCP container's env) - Update Google Cloud Console OAuth app: add redirect URI
https://auth.autri.ai/oauth2/idpresponse(from CDK-provisioned Cognito) - Tear down spike Cognito pool (no longer needed once prod pool is live)
Day 10 — Deploy + DNS + monitoring stack
- Build + push MCP server arm64 container to ECR (manually first time; GitHub Actions wiring on Day 12)
- Deploy to AgentCore Runtime via CDK with
requestHeaderConfiguration.requestHeaderAllowlist=["Authorization"](per F1) andcustomJWTAuthorizer.discoveryUrlpointing at Cognito's/.well-known/openid-configuration(no proxy in beta per D41) - Provision CloudFront distribution in front of AgentCore Runtime (per F3) — origin path policy rewrites
/→/runtimes/{encoded-arn}/invocations?qualifier=DEFAULT; uses themcp.autri.aiACM cert - Push autri app to GitHub main → Amplify builds + deploys
- Cloudflare DNS: CNAME
app.autri.ai→ Amplify; CNAMEmcp.autri.ai→ CloudFront (not AgentCore directly); CNAMEauth.autri.ai→ Cognito custom domain - ACM cert validation completes (may take minutes-hours)
- Smoke test:
curl https://app.autri.aireturns app;curl -i https://mcp.autri.ai/mcpreturns 401 withWWW-Authenticateheader (valid auth challenge per OAuth 2.1 + MCP spec) - Define
monitoringstack: CloudWatch dashboards + Budgets + 30-day log retention + Logs Insights queries cdk deploy autri-monitoring
Day 11 — Database migration + Auth wiring + Cross-domain SSO validation + Allowlist Lambda
- pg_dump local Postgres → SQL file
- Restore to RDS:
psql ... -f dump.sql(extension already enabled in Day 8) - Verify: counts of users, KBs, libraries, connectors, chat_queries, notifications schema match local
- Update Amplify env vars with RDS connection string (via Parameter Store reference)
- Post-confirmation Lambda deployed with three-step logic:
- Allowlist check (reject + delete Cognito user if email not in
ALLOWED_EMAILS) - Auto-provision personal org + Personal library + library_access (if allowlisted)
- Insert welcome notification row
- Allowlist check (reject + delete Cognito user if email not in
- Combined Day 11 validation: real Cognito flow + cross-domain SSO + allowlist + power-user manual-config MCP flow:
- Add Dan's email to
ALLOWED_EMAILSParameter Store - Open
app.autri.ai→ login with Google (validates Cognito custom domain + allowlist) - Land on dashboard → confirm welcome notification appears in bell UI
- Generate a connector → see endpoint URL + client_id + client_secret
- Mint a bearer token from the connector credentials (Cognito token endpoint, client_credentials grant; Pre-Token-Gen Lambda injects
connector_idclaim) - Configure Claude Desktop Advanced settings manually with: server URL (
https://mcp.autri.ai/mcp), client_id, client_secret, bearer token. This is the beta MCP UX per D41 — no RFC 8414 discovery, no DCR. - Invoke a tool from Claude Desktop, confirm successful query against the deployed AgentCore Runtime
- Verify the Cognito-issued JWT validates against
mcp.autri.ai's resource server (cross-domain SSO works) - Test allowlist rejection: sign in with a non-allowlisted Google account, confirm rejection + Cognito user cleanup
- Add Dan's email to
Day 12 — Cost telemetry verification + CI wiring
- CloudWatch dashboard verified: per-stack-layer cost, MCP session counts, RDS metrics, chat query counts, notifications counts
- AWS Budgets alerts at $50/$100/$200/mo confirmed firing (small test invocation)
- GitHub Actions workflow (
.github/workflows/deploy-mcp.yml) deployed: builds arm64 image on push tomcp-servers/doc-search/**, pushes to ECR, triggers AgentCore update - Verify resource tags applied to everything via Cost Explorer breakdown
Day 12 email infrastructure end-to-end test — cut per requirements blue-team
Risks
- ACM cert provisioning timing — DNS validation can take 5 min to 24h. Mitigation: request certs at start of Day 9 + immediately add validation CNAMEs in Cloudflare for parallel propagation.
- pg_dump/restore cutover requires brief downtime. Acceptable for beta (no users yet). Mitigation: communicate in advance if anyone is testing; verify integrity post-restore before re-pointing app at RDS.
- Amplify build for Next.js 14 App Router may have config quirks first time. Mitigation: explicit build config on Day 9; fallback plan to deploy via S3 + CloudFront manually if Amplify build fails repeatedly (~1 day of fallback work documented but not pre-built).
- Bedrock model-access approval is asynchronous (24-48h) — don't block deploy on it. Anthropic API direct continues to serve LLM traffic until Bedrock approves. If approval denied or info requested, stay on Anthropic API and resubmit with more detail.
- AWS SES production access (out of sandbox) is asynchronous. If still in sandbox by Day 13, beta users' emails must be pre-verified individually — workable for 3-user beta but flag in EPIC-5.
- AgentCore Runtime container architecture locked as arm64 (cheaper). Build pipeline must match (GitHub Actions buildx for arm64).
- Cognito Google federation config errors — IdP setup has many small fields. Mitigation: validated in EPIC-1 spike; CDK re-provisioning may surface field-mapping bugs that didn't exist in console-clicked spike. Allow buffer.
- Cognito custom domain
auth.autri.aiACM cert must be in us-east-1 (Cognito requirement, regardless of where the user pool is). Two cert requests in us-east-1 (the custom-domain one + the app/mcp ones if pool is also us-east-1). - pgvector extension version pin. RDS Postgres 16 ships pgvector ≥0.5.0 (verify in parameter group selection). If we need newer features, may need to upgrade extension via
ALTER EXTENSION vector UPDATE;post-deploy. - Post-confirmation Lambda trigger error handling. If the Lambda fails on a user signup, the user is created in Cognito but has no Autri-side records (broken state). Mitigation: Lambda must be idempotent + alert on failures; manual cleanup script as fallback.
- Cross-domain Cognito SSO failure. Day 11 validation may surface that the same Cognito JWT doesn't validate cleanly against both subdomains (audience/issuer mismatch). Mitigation: if fails, configure separate OAuth resource servers per subdomain.
Definition of done
- EPIC-1 spike artifacts torn down (Day 8 Step 0)
- CDK project committed to
autri-infrarepo; 3 stacks deployed (network-and-data,auth-and-compute,monitoring) - RDS Postgres 16 + pgvector live (custom parameter group +
CREATE EXTENSIONmigration ran successfully) - All EPIC-2 tables created on RDS: libraries, library_kbs, library_access, connectors, mcp_audit_log, chat_queries, notifications
- Amplify deploys Next.js 14 App Router app on push to main; live at
app.autri.aiover HTTPS - AgentCore Runtime hosts MCP server (arm64); live at
mcp.autri.aiover HTTPS - Cognito custom domain
auth.autri.ailive (replaces ugly default Cognito URL) - Cognito Google federation works end-to-end through connector creation flow
- Cross-domain Cognito SSO verified: same JWT validates against both
app.autri.aiandmcp.autri.ai - Stub-auth removed from MCP server (
AGENTCORE_AUTH=noneflag deleted; real Cognito JWKS validation active) - Post-confirmation Lambda deployed with allowlist check + auto-provisioning + welcome notification insert
-
ALLOWED_EMAILSParameter Store populated with Dan + mom + STEM Racing engineer emails - Allowlist rejection tested (non-allowlisted Google account is denied + Cognito user record deleted)
- Data migrated from local Postgres to RDS, integrity verified (counts match)
- GitHub Actions workflow deploys MCP container on push to
mcp-servers/doc-search/** - Cost dashboards live with per-stack-layer breakdown
- AWS Budgets alerts armed at $50/$100/$200/mo
- CloudWatch Log Groups have 30-day retention (not never-expire default)
- CDK code committed to
autri-infraGitHub - Bedrock model-access approval submitted for Sonnet 4.6 + Haiku 4.5 (approval status logged)
- All resources tagged correctly (
project=autri env=beta cost-bucket=<layer>) - CloudWatch Logs aggregated; Logs Insights queries saved (error rate, top errors, MCP tool calls)
Notes / open questions
Locked this triage pass (2026-05-19):
- CDK in separate
autri-infrarepo (not in app repo) - 3 CDK stacks:
network-and-data/auth-and-compute/monitoring - RDS pgvector via custom parameter group +
CREATE EXTENSIONin first migration auth.autri.ai= Cognito custom domain (ACM cert in us-east-1)- Cognito hosted UI default for beta (no custom Amplify Auth UI; v1.1 polish)
- Cross-domain SSO test explicit in Day 11 (
app.autri.ai↔mcp.autri.ai) - Stub-auth removal as combined Day 11 step
- AgentCore container CI via GitHub Actions on push to
mcp-servers/doc-search/** - Secrets Manager for credentials, Parameter Store for config (including
ALLOWED_EMAILS) - Bedrock model IDs specified:
claude-sonnet-4-6-20251001-v1:0+claude-haiku-4-5-20251001-v1:0 - CloudWatch log retention set to 30 days
- Bedrock approval use-case description pre-written (see Scope)
- Post-confirmation Lambda does allowlist check + auto-provisioning + welcome notification insert
- Day 8 Step 0: explicit spike artifact teardown
- Container architecture: arm64 (locked from EPIC-1)
Cut by requirements blue-team (2026-05-19):
AWS SES + DNS verification + sending wrapper— replaced by in-app notifications (EPIC-2 schema + EPIC-5 UI)SES production access request— no email infrastructure at all— replaced bybeta@autri.aioutbound aliasmailto:link on login page (no infra needed)Cost Anomaly Detection daily email— AWS Budgets thresholds sufficient
Still open (decide during implementation):
- RDS instance class — t4g.small for beta; watch metrics and scale to t4g.medium during high-traffic ingestion if needed
- Whether
autri-infrarepo should be public or private — lean private until v1.1 - GitHub Personal Access Token vs GitHub App for the feedback-issues integration (EPIC-5) — lean PAT for v1, GitHub App when team grows
- Amplify SSR Lambda memory/timeout — start at defaults, tune if cold start UX suffers
ALLOWED_EMAILSParameter Store update flow — manual edit + Lambda redeploy for v1; automate when allowlist grows past 10 users