EPIC-1: AgentCore Runtime POC Spike

Drafted 2026-05-19. Beta-sprint epic 1 of 5. Sequencing: Week 0 prep (half day). Can run in parallel with EPIC-2 + EPIC-3.

Goal

Validate Stack B (AgentCore Runtime + Amplify + RDS + Cognito) end-to-end with minimum-viable scope. Decision criteria at the end: lock Stack B OR fall back to Stack A (all-Fargate from D33). This is the de-risking step before committing AWS production-deploy work to AgentCore.

Why this epic exists

AgentCore Runtime is 2 months old at decision time and we don't have community-validated patterns yet. Subagent research (see aws-infra-options.md § Open questions thread) confirmed the documented behavior — Streamable HTTP transport required, sub-second cold-start from warm pool, 8h compute lifecycle with logical session persistence — but documented behavior is not always real behavior. This spike validates the docs against our actual workload before we lock the architecture.

Scope (in)

AWS account hygiene: AWS Organization (Hannah Labs root) + autri-prod sub-account, IAM admin role + MFA, resource tagging strategy (project=autri env=beta cost-bucket=<layer>)
AWS account opened under Hannah Labs entity (Hannah Labs Google Workspace admin email, Hannah Labs account name); Dan's personal card as temporary billing method, Mercury card replaces in ~1 week
AWS Activate Founders application submitted in parallel (Hannah Labs account; $1k credits)
Google Cloud Console OAuth app registered under Hannah Labs Google Workspace (for Cognito Google federation)
Cognito user pool with Google federated identity provider
Hello-world MCP server (arm64 container per AgentCore packaging spec; listens on 0.0.0.0:8000/mcp) deployed to AgentCore Runtime — single tool returning "hello world"
Stub-auth fallback prepared (env var AGENTCORE_AUTH=none skips JWT validation) — used if Cognito integration debugging threatens the half-day budget; swap to real Cognito once container + transport validated
Streamable HTTP transport validated end-to-end via Claude Desktop
Cognito OAuth flow: app authenticates user → token issued → token validates against MCP endpoint (against mcp.autri.ai only; cross-subdomain SSO with app.autri.ai deferred to EPIC-4)
AgentCore endpoint URL stability verified (custom-domain mapping OR stable ARN-based endpoint) before pointing Cloudflare DNS
Cost telemetry setup: tags, CloudWatch dashboard, AWS Budgets alerts at $50/$100/$200/mo thresholds, Cost Anomaly Detection daily email
Bedrock model-access approval submitted (Sonnet + Haiku) — fire early so the 24-48h clock starts
Claude Desktop reconnect behavior at the 8h compute boundary — tested via forced microVM swap (redeploy AgentCore container mid-session), NOT by waiting 8h

Out of scope

Production MCP server logic (just hello-world; real server is EPIC-3)
Library/connector schema (EPIC-2)
Amplify deploy of the Next.js app (EPIC-4)
Beta user onboarding (EPIC-5)

Dependencies

Dan creates the Hannah Labs AWS account if not already (estimated 15 min)
mcp.autri.ai subdomain available in Cloudflare DNS (already controlled)
Local dev environment functional (already in place)

Deliverables

Hello-world MCP server live at mcp.autri.ai on AgentCore Runtime
Working Cognito user pool with Google federation
Spike findings written up at projects/autri/epics/epic-1-spike-findings.md (~half page)
Explicit go/no-go decision: "Stack B locked" OR "Fallback to Stack A — blocker is X"
Budgets alerts armed in production AWS account
Bedrock model-access approval request submitted

Implementation plan (half day)

Step 0 — Pre-spike (do before sitting down for the half-day; can be done async over a few days)

Open AWS account under Hannah Labs (Workspace admin email, account name = "Hannah Labs"); enable MFA on root
Submit AWS Activate Founders application (Hannah Labs account; ~5 min)
Register Google Cloud Console OAuth app under Hannah Labs Workspace (~15 min)
Start Mercury business account application (~10 min; physical/virtual card arrives in ~3-5 business days; update AWS billing card when it does)

Step 1 — AWS account hygiene (30 min)

Enable AWS Organization at Hannah Labs root account
Create autri-prod sub-account
IAM admin role for Dan in autri-prod, MFA enforced
Resource tagging policy documented (project=autri env=beta cost-bucket=<layer>)
AWS Budgets at $50/$100/$200/mo thresholds
Request ACM cert for mcp.autri.ai FIRST — the DNS validation CNAME needs to propagate through Cloudflare. Add the validation CNAME to Cloudflare immediately so the cert validates in the background while later steps proceed.

Step 2 — Cognito setup (30 min)

User pool in us-east-1
Google federated IdP configured (using OAuth app registered in Step 0)
OAuth resource server: mcp.autri.ai with custom scopes
Hosted UI URL noted
Prepare stub-auth fallback: one env var (AGENTCORE_AUTH=none) in the hello-world server that skips JWT validation — used only if Cognito integration debugging blows the budget in Step 4

Step 3 — AgentCore Runtime deploy (1-2 hours)

Containerize hello-world MCP server: arm64, Node or Python, Streamable HTTP per AgentCore protocol contract, listens on 0.0.0.0:8000/mcp, single tool returning "hello world"
Push container to ECR
Configure AgentCore Runtime: container image, Cognito as OAuth issuer (or stub-auth if needed), ACM cert for mcp.autri.ai
Verify AgentCore endpoint URL stability (custom-domain mapping OR stable ARN endpoint) before next step — if endpoint changes per redeploy, surface as blocker
Cloudflare DNS: CNAME mcp.autri.ai → AgentCore endpoint

Step 4 — Claude Desktop validation (30 min)

Add MCP connector config to Claude Desktop pointing at https://mcp.autri.ai
Authenticate via Cognito OAuth flow (or skip auth if running stub fallback)
Invoke the hello-world tool
Verify Streamable HTTP transport (inspect response headers; verify text/event-stream content-type if streaming)
If running stub fallback: swap AGENTCORE_AUTH=none → real Cognito, repeat Claude Desktop test, debug auth layer in isolation

Step 5 — Cost telemetry setup (30 min)

CloudWatch dashboard with AgentCore session metrics (vCPU-seconds, GB-hours)
Cost Explorer view filtered by project=autri
Budgets alerts verified firing on a test trigger
Cost Anomaly Detection email subscription confirmed

Step 6 — 8h reconnect test via forced microVM swap (30 min)

Open a Claude Desktop session against AgentCore, confirm a tool call works
Redeploy the AgentCore container (forces a new microVM with the same Mcp-Session-Id)
Observe Claude Desktop behavior: does it reconnect transparently, or see a hard disconnect requiring a fresh handshake?
Record finding (this is the most uncertain client-behavior question per the spike's premise — don't skip)

Step 7 — Documentation (30 min)

Write spike-findings doc at projects/autri/epics/epic-1-spike-findings.md
Record the lock/fallback decision
Note IaC strategy for EPIC-4: console-click spike artifacts will be torn down and rebuilt in CDK during EPIC-4; spike artifacts are throwaway by design

Risks

AgentCore Runtime + Cognito OAuth config first-time setup may consume the half-day budget. Mitigation: stub-auth fallback (AGENTCORE_AUTH=none) prepared in Step 2 so AgentCore deploy + transport validation can proceed independently of Cognito. Real Cognito swap happens after stub-validated baseline.
ACM cert DNS validation through Cloudflare is the slow part, not the cert request itself. Mitigation: request cert in Step 1 + immediately add the validation CNAME to Cloudflare DNS so propagation happens in parallel with later work.
AgentCore endpoint URL stability — if the AgentCore endpoint changes per redeploy, we can't point Cloudflare at it cleanly. Mitigation: verify in Step 3 BEFORE configuring DNS; if endpoint isn't stable, surface as a blocker or use AgentCore custom-domain mapping if available.
Bedrock model-access approval is async (24-48h). Submit at start of Step 0 to start the clock; not blocking the spike itself.
Mercury card arrival (~3-5 business days) doesn't block EPIC-1; AWS billing card updates at any time. Dan's personal card on AWS during the bridge period is acceptable since account ownership = Hannah Labs from day one.
Stack A fallback path: if AgentCore reveals an unrecoverable blocker, fall back to all-Fargate (Stack A from aws-infra-options.md) without architectural re-design.

Definition of done

Pre-spike (Step 0):

Hannah Labs AWS account created (entity ownership, not personal)
AWS Activate Founders application submitted
Google Cloud Console OAuth app registered under Hannah Labs Workspace
Mercury business account application started

Spike (Steps 1-7):

Notes / open questions

Locked:

Container architecture: arm64 (cheaper, AgentCore supports both)
IaC strategy: console-click spike, tear down + rebuild in CDK for EPIC-4
Cross-subdomain Cognito SSO test (app.autri.ai ↔ mcp.autri.ai) moved to EPIC-4 (since app.autri.ai doesn't exist until then)
AWS root-user MFA: 1Password authenticator for v1 (Dan already uses 1Password); hardware key deferred to post-beta when revenue justifies

Still open (to validate empirically during spike):

Does AgentCore Runtime offer a stable custom-domain mapping, or does endpoint URL change per redeploy? (Verify in Step 3 before DNS)
If Cognito hosted UI feels too ugly for the spike, document as a v1.1 fix-up; not a blocker.
AWS account billing card swap to Mercury — likely happens ~5 days after spike, parallel to EPIC-2/EPIC-3 work

EPIC-1: AgentCore Runtime POC Spike#

Goal#

Why this epic exists#

Scope (in)#

Out of scope#

Dependencies#

Deliverables#

Implementation plan (half day)#

Risks#

Definition of done#

Notes / open questions#

Review