EPIC-1: AgentCore Runtime POC Spike
Drafted 2026-05-19. Beta-sprint epic 1 of 5. Sequencing: Week 0 prep (half day). Can run in parallel with EPIC-2 + EPIC-3.
Goal
Validate Stack B (AgentCore Runtime + Amplify + RDS + Cognito) end-to-end with minimum-viable scope. Decision criteria at the end: lock Stack B OR fall back to Stack A (all-Fargate from D33). This is the de-risking step before committing AWS production-deploy work to AgentCore.
Why this epic exists
AgentCore Runtime is 2 months old at decision time and we don't have community-validated patterns yet. Subagent research (see aws-infra-options.md § Open questions thread) confirmed the documented behavior — Streamable HTTP transport required, sub-second cold-start from warm pool, 8h compute lifecycle with logical session persistence — but documented behavior is not always real behavior. This spike validates the docs against our actual workload before we lock the architecture.
Scope (in)
- AWS account hygiene: AWS Organization (Hannah Labs root) +
autri-prodsub-account, IAM admin role + MFA, resource tagging strategy (project=autri env=beta cost-bucket=<layer>) - AWS account opened under Hannah Labs entity (Hannah Labs Google Workspace admin email, Hannah Labs account name); Dan's personal card as temporary billing method, Mercury card replaces in ~1 week
- AWS Activate Founders application submitted in parallel (Hannah Labs account; $1k credits)
- Google Cloud Console OAuth app registered under Hannah Labs Google Workspace (for Cognito Google federation)
- Cognito user pool with Google federated identity provider
- Hello-world MCP server (arm64 container per AgentCore packaging spec; listens on
0.0.0.0:8000/mcp) deployed to AgentCore Runtime — single tool returning"hello world" - Stub-auth fallback prepared (env var
AGENTCORE_AUTH=noneskips JWT validation) — used if Cognito integration debugging threatens the half-day budget; swap to real Cognito once container + transport validated - Streamable HTTP transport validated end-to-end via Claude Desktop
- Cognito OAuth flow: app authenticates user → token issued → token validates against MCP endpoint (against
mcp.autri.aionly; cross-subdomain SSO withapp.autri.aideferred to EPIC-4) - AgentCore endpoint URL stability verified (custom-domain mapping OR stable ARN-based endpoint) before pointing Cloudflare DNS
- Cost telemetry setup: tags, CloudWatch dashboard, AWS Budgets alerts at $50/$100/$200/mo thresholds, Cost Anomaly Detection daily email
- Bedrock model-access approval submitted (Sonnet + Haiku) — fire early so the 24-48h clock starts
- Claude Desktop reconnect behavior at the 8h compute boundary — tested via forced microVM swap (redeploy AgentCore container mid-session), NOT by waiting 8h
Out of scope
- Production MCP server logic (just hello-world; real server is EPIC-3)
- Library/connector schema (EPIC-2)
- Amplify deploy of the Next.js app (EPIC-4)
- Beta user onboarding (EPIC-5)
Dependencies
- Dan creates the Hannah Labs AWS account if not already (estimated 15 min)
mcp.autri.aisubdomain available in Cloudflare DNS (already controlled)- Local dev environment functional (already in place)
Deliverables
- Hello-world MCP server live at
mcp.autri.aion AgentCore Runtime - Working Cognito user pool with Google federation
- Spike findings written up at
projects/autri/epics/epic-1-spike-findings.md(~half page) - Explicit go/no-go decision: "Stack B locked" OR "Fallback to Stack A — blocker is X"
- Budgets alerts armed in production AWS account
- Bedrock model-access approval request submitted
Implementation plan (half day)
Step 0 — Pre-spike (do before sitting down for the half-day; can be done async over a few days)
- Open AWS account under Hannah Labs (Workspace admin email, account name = "Hannah Labs"); enable MFA on root
- Submit AWS Activate Founders application (Hannah Labs account; ~5 min)
- Register Google Cloud Console OAuth app under Hannah Labs Workspace (~15 min)
- Start Mercury business account application (~10 min; physical/virtual card arrives in ~3-5 business days; update AWS billing card when it does)
Step 1 — AWS account hygiene (30 min)
- Enable AWS Organization at Hannah Labs root account
- Create
autri-prodsub-account - IAM admin role for Dan in autri-prod, MFA enforced
- Resource tagging policy documented (
project=autri env=beta cost-bucket=<layer>) - AWS Budgets at $50/$100/$200/mo thresholds
- Request ACM cert for
mcp.autri.aiFIRST — the DNS validation CNAME needs to propagate through Cloudflare. Add the validation CNAME to Cloudflare immediately so the cert validates in the background while later steps proceed.
Step 2 — Cognito setup (30 min)
- User pool in us-east-1
- Google federated IdP configured (using OAuth app registered in Step 0)
- OAuth resource server:
mcp.autri.aiwith custom scopes - Hosted UI URL noted
- Prepare stub-auth fallback: one env var (
AGENTCORE_AUTH=none) in the hello-world server that skips JWT validation — used only if Cognito integration debugging blows the budget in Step 4
Step 3 — AgentCore Runtime deploy (1-2 hours)
- Containerize hello-world MCP server: arm64, Node or Python, Streamable HTTP per AgentCore protocol contract, listens on
0.0.0.0:8000/mcp, single tool returning"hello world" - Push container to ECR
- Configure AgentCore Runtime: container image, Cognito as OAuth issuer (or stub-auth if needed), ACM cert for
mcp.autri.ai - Verify AgentCore endpoint URL stability (custom-domain mapping OR stable ARN endpoint) before next step — if endpoint changes per redeploy, surface as blocker
- Cloudflare DNS: CNAME
mcp.autri.ai→ AgentCore endpoint
Step 4 — Claude Desktop validation (30 min)
- Add MCP connector config to Claude Desktop pointing at
https://mcp.autri.ai - Authenticate via Cognito OAuth flow (or skip auth if running stub fallback)
- Invoke the hello-world tool
- Verify Streamable HTTP transport (inspect response headers; verify
text/event-streamcontent-type if streaming) - If running stub fallback: swap
AGENTCORE_AUTH=none→ real Cognito, repeat Claude Desktop test, debug auth layer in isolation
Step 5 — Cost telemetry setup (30 min)
- CloudWatch dashboard with AgentCore session metrics (vCPU-seconds, GB-hours)
- Cost Explorer view filtered by
project=autri - Budgets alerts verified firing on a test trigger
- Cost Anomaly Detection email subscription confirmed
Step 6 — 8h reconnect test via forced microVM swap (30 min)
- Open a Claude Desktop session against AgentCore, confirm a tool call works
- Redeploy the AgentCore container (forces a new microVM with the same
Mcp-Session-Id) - Observe Claude Desktop behavior: does it reconnect transparently, or see a hard disconnect requiring a fresh handshake?
- Record finding (this is the most uncertain client-behavior question per the spike's premise — don't skip)
Step 7 — Documentation (30 min)
- Write spike-findings doc at
projects/autri/epics/epic-1-spike-findings.md - Record the lock/fallback decision
- Note IaC strategy for EPIC-4: console-click spike artifacts will be torn down and rebuilt in CDK during EPIC-4; spike artifacts are throwaway by design
Risks
- AgentCore Runtime + Cognito OAuth config first-time setup may consume the half-day budget. Mitigation: stub-auth fallback (
AGENTCORE_AUTH=none) prepared in Step 2 so AgentCore deploy + transport validation can proceed independently of Cognito. Real Cognito swap happens after stub-validated baseline. - ACM cert DNS validation through Cloudflare is the slow part, not the cert request itself. Mitigation: request cert in Step 1 + immediately add the validation CNAME to Cloudflare DNS so propagation happens in parallel with later work.
- AgentCore endpoint URL stability — if the AgentCore endpoint changes per redeploy, we can't point Cloudflare at it cleanly. Mitigation: verify in Step 3 BEFORE configuring DNS; if endpoint isn't stable, surface as a blocker or use AgentCore custom-domain mapping if available.
- Bedrock model-access approval is async (24-48h). Submit at start of Step 0 to start the clock; not blocking the spike itself.
- Mercury card arrival (~3-5 business days) doesn't block EPIC-1; AWS billing card updates at any time. Dan's personal card on AWS during the bridge period is acceptable since account ownership = Hannah Labs from day one.
- Stack A fallback path: if AgentCore reveals an unrecoverable blocker, fall back to all-Fargate (Stack A from
aws-infra-options.md) without architectural re-design.
Definition of done
Pre-spike (Step 0):
- Hannah Labs AWS account created (entity ownership, not personal)
- AWS Activate Founders application submitted
- Google Cloud Console OAuth app registered under Hannah Labs Workspace
- Mercury business account application started
Spike (Steps 1-7):
- AWS Organization + autri-prod sub-account
- IAM admin role with MFA configured
- Resource tagging strategy applied
- ACM cert for
mcp.autri.airequested + Cloudflare DNS validation CNAME added - Cognito user pool live with Google federation working
- Stub-auth fallback prepared (
AGENTCORE_AUTH=none) - Hello-world MCP server (arm64, Streamable HTTP,
0.0.0.0:8000/mcp) deployed to AgentCore Runtime - AgentCore endpoint URL stability verified before Cloudflare DNS pointed
- Claude Desktop connects successfully via Streamable HTTP
- Cognito OAuth flow validated end-to-end against
mcp.autri.ai(cross-subdomain SSO test deferred to EPIC-4) - CloudWatch dashboard shows AgentCore session metrics
- AWS Budgets alerts armed at $50/$100/$200/mo
- Cost Anomaly Detection email subscription active
- Bedrock model-access approval submitted
- 8h reconnect behavior tested via forced microVM swap
- Spike findings documented
- Explicit Stack B / Stack A decision recorded
- IaC strategy for EPIC-4 noted in findings doc (rebuild in CDK)
Notes / open questions
Locked:
- Container architecture: arm64 (cheaper, AgentCore supports both)
- IaC strategy: console-click spike, tear down + rebuild in CDK for EPIC-4
- Cross-subdomain Cognito SSO test (
app.autri.ai↔mcp.autri.ai) moved to EPIC-4 (sinceapp.autri.aidoesn't exist until then) - AWS root-user MFA: 1Password authenticator for v1 (Dan already uses 1Password); hardware key deferred to post-beta when revenue justifies
Still open (to validate empirically during spike):
- Does AgentCore Runtime offer a stable custom-domain mapping, or does endpoint URL change per redeploy? (Verify in Step 3 before DNS)
- If Cognito hosted UI feels too ugly for the spike, document as a v1.1 fix-up; not a blocker.
- AWS account billing card swap to Mercury — likely happens ~5 days after spike, parallel to EPIC-2/EPIC-3 work