Foundry Foundry

EPIC-1: AgentCore Runtime POC Spike

Drafted 2026-05-19. Beta-sprint epic 1 of 5. Sequencing: Week 0 prep (half day). Can run in parallel with EPIC-2 + EPIC-3.

Goal

Validate Stack B (AgentCore Runtime + Amplify + RDS + Cognito) end-to-end with minimum-viable scope. Decision criteria at the end: lock Stack B OR fall back to Stack A (all-Fargate from D33). This is the de-risking step before committing AWS production-deploy work to AgentCore.

Why this epic exists

AgentCore Runtime is 2 months old at decision time and we don't have community-validated patterns yet. Subagent research (see aws-infra-options.md § Open questions thread) confirmed the documented behavior — Streamable HTTP transport required, sub-second cold-start from warm pool, 8h compute lifecycle with logical session persistence — but documented behavior is not always real behavior. This spike validates the docs against our actual workload before we lock the architecture.

Scope (in)

  • AWS account hygiene: AWS Organization (Hannah Labs root) + autri-prod sub-account, IAM admin role + MFA, resource tagging strategy (project=autri env=beta cost-bucket=<layer>)
  • AWS account opened under Hannah Labs entity (Hannah Labs Google Workspace admin email, Hannah Labs account name); Dan's personal card as temporary billing method, Mercury card replaces in ~1 week
  • AWS Activate Founders application submitted in parallel (Hannah Labs account; $1k credits)
  • Google Cloud Console OAuth app registered under Hannah Labs Google Workspace (for Cognito Google federation)
  • Cognito user pool with Google federated identity provider
  • Hello-world MCP server (arm64 container per AgentCore packaging spec; listens on 0.0.0.0:8000/mcp) deployed to AgentCore Runtime — single tool returning "hello world"
  • Stub-auth fallback prepared (env var AGENTCORE_AUTH=none skips JWT validation) — used if Cognito integration debugging threatens the half-day budget; swap to real Cognito once container + transport validated
  • Streamable HTTP transport validated end-to-end via Claude Desktop
  • Cognito OAuth flow: app authenticates user → token issued → token validates against MCP endpoint (against mcp.autri.ai only; cross-subdomain SSO with app.autri.ai deferred to EPIC-4)
  • AgentCore endpoint URL stability verified (custom-domain mapping OR stable ARN-based endpoint) before pointing Cloudflare DNS
  • Cost telemetry setup: tags, CloudWatch dashboard, AWS Budgets alerts at $50/$100/$200/mo thresholds, Cost Anomaly Detection daily email
  • Bedrock model-access approval submitted (Sonnet + Haiku) — fire early so the 24-48h clock starts
  • Claude Desktop reconnect behavior at the 8h compute boundary — tested via forced microVM swap (redeploy AgentCore container mid-session), NOT by waiting 8h

Out of scope

  • Production MCP server logic (just hello-world; real server is EPIC-3)
  • Library/connector schema (EPIC-2)
  • Amplify deploy of the Next.js app (EPIC-4)
  • Beta user onboarding (EPIC-5)

Dependencies

  • Dan creates the Hannah Labs AWS account if not already (estimated 15 min)
  • mcp.autri.ai subdomain available in Cloudflare DNS (already controlled)
  • Local dev environment functional (already in place)

Deliverables

  • Hello-world MCP server live at mcp.autri.ai on AgentCore Runtime
  • Working Cognito user pool with Google federation
  • Spike findings written up at projects/autri/epics/epic-1-spike-findings.md (~half page)
  • Explicit go/no-go decision: "Stack B locked" OR "Fallback to Stack A — blocker is X"
  • Budgets alerts armed in production AWS account
  • Bedrock model-access approval request submitted

Implementation plan (half day)

Step 0 — Pre-spike (do before sitting down for the half-day; can be done async over a few days)

  • Open AWS account under Hannah Labs (Workspace admin email, account name = "Hannah Labs"); enable MFA on root
  • Submit AWS Activate Founders application (Hannah Labs account; ~5 min)
  • Register Google Cloud Console OAuth app under Hannah Labs Workspace (~15 min)
  • Start Mercury business account application (~10 min; physical/virtual card arrives in ~3-5 business days; update AWS billing card when it does)

Step 1 — AWS account hygiene (30 min)

  • Enable AWS Organization at Hannah Labs root account
  • Create autri-prod sub-account
  • IAM admin role for Dan in autri-prod, MFA enforced
  • Resource tagging policy documented (project=autri env=beta cost-bucket=<layer>)
  • AWS Budgets at $50/$100/$200/mo thresholds
  • Request ACM cert for mcp.autri.ai FIRST — the DNS validation CNAME needs to propagate through Cloudflare. Add the validation CNAME to Cloudflare immediately so the cert validates in the background while later steps proceed.

Step 2 — Cognito setup (30 min)

  • User pool in us-east-1
  • Google federated IdP configured (using OAuth app registered in Step 0)
  • OAuth resource server: mcp.autri.ai with custom scopes
  • Hosted UI URL noted
  • Prepare stub-auth fallback: one env var (AGENTCORE_AUTH=none) in the hello-world server that skips JWT validation — used only if Cognito integration debugging blows the budget in Step 4

Step 3 — AgentCore Runtime deploy (1-2 hours)

  • Containerize hello-world MCP server: arm64, Node or Python, Streamable HTTP per AgentCore protocol contract, listens on 0.0.0.0:8000/mcp, single tool returning "hello world"
  • Push container to ECR
  • Configure AgentCore Runtime: container image, Cognito as OAuth issuer (or stub-auth if needed), ACM cert for mcp.autri.ai
  • Verify AgentCore endpoint URL stability (custom-domain mapping OR stable ARN endpoint) before next step — if endpoint changes per redeploy, surface as blocker
  • Cloudflare DNS: CNAME mcp.autri.ai → AgentCore endpoint

Step 4 — Claude Desktop validation (30 min)

  • Add MCP connector config to Claude Desktop pointing at https://mcp.autri.ai
  • Authenticate via Cognito OAuth flow (or skip auth if running stub fallback)
  • Invoke the hello-world tool
  • Verify Streamable HTTP transport (inspect response headers; verify text/event-stream content-type if streaming)
  • If running stub fallback: swap AGENTCORE_AUTH=none → real Cognito, repeat Claude Desktop test, debug auth layer in isolation

Step 5 — Cost telemetry setup (30 min)

  • CloudWatch dashboard with AgentCore session metrics (vCPU-seconds, GB-hours)
  • Cost Explorer view filtered by project=autri
  • Budgets alerts verified firing on a test trigger
  • Cost Anomaly Detection email subscription confirmed

Step 6 — 8h reconnect test via forced microVM swap (30 min)

  • Open a Claude Desktop session against AgentCore, confirm a tool call works
  • Redeploy the AgentCore container (forces a new microVM with the same Mcp-Session-Id)
  • Observe Claude Desktop behavior: does it reconnect transparently, or see a hard disconnect requiring a fresh handshake?
  • Record finding (this is the most uncertain client-behavior question per the spike's premise — don't skip)

Step 7 — Documentation (30 min)

  • Write spike-findings doc at projects/autri/epics/epic-1-spike-findings.md
  • Record the lock/fallback decision
  • Note IaC strategy for EPIC-4: console-click spike artifacts will be torn down and rebuilt in CDK during EPIC-4; spike artifacts are throwaway by design

Risks

  • AgentCore Runtime + Cognito OAuth config first-time setup may consume the half-day budget. Mitigation: stub-auth fallback (AGENTCORE_AUTH=none) prepared in Step 2 so AgentCore deploy + transport validation can proceed independently of Cognito. Real Cognito swap happens after stub-validated baseline.
  • ACM cert DNS validation through Cloudflare is the slow part, not the cert request itself. Mitigation: request cert in Step 1 + immediately add the validation CNAME to Cloudflare DNS so propagation happens in parallel with later work.
  • AgentCore endpoint URL stability — if the AgentCore endpoint changes per redeploy, we can't point Cloudflare at it cleanly. Mitigation: verify in Step 3 BEFORE configuring DNS; if endpoint isn't stable, surface as a blocker or use AgentCore custom-domain mapping if available.
  • Bedrock model-access approval is async (24-48h). Submit at start of Step 0 to start the clock; not blocking the spike itself.
  • Mercury card arrival (~3-5 business days) doesn't block EPIC-1; AWS billing card updates at any time. Dan's personal card on AWS during the bridge period is acceptable since account ownership = Hannah Labs from day one.
  • Stack A fallback path: if AgentCore reveals an unrecoverable blocker, fall back to all-Fargate (Stack A from aws-infra-options.md) without architectural re-design.

Definition of done

Pre-spike (Step 0):

  • Hannah Labs AWS account created (entity ownership, not personal)
  • AWS Activate Founders application submitted
  • Google Cloud Console OAuth app registered under Hannah Labs Workspace
  • Mercury business account application started

Spike (Steps 1-7):

  • AWS Organization + autri-prod sub-account
  • IAM admin role with MFA configured
  • Resource tagging strategy applied
  • ACM cert for mcp.autri.ai requested + Cloudflare DNS validation CNAME added
  • Cognito user pool live with Google federation working
  • Stub-auth fallback prepared (AGENTCORE_AUTH=none)
  • Hello-world MCP server (arm64, Streamable HTTP, 0.0.0.0:8000/mcp) deployed to AgentCore Runtime
  • AgentCore endpoint URL stability verified before Cloudflare DNS pointed
  • Claude Desktop connects successfully via Streamable HTTP
  • Cognito OAuth flow validated end-to-end against mcp.autri.ai (cross-subdomain SSO test deferred to EPIC-4)
  • CloudWatch dashboard shows AgentCore session metrics
  • AWS Budgets alerts armed at $50/$100/$200/mo
  • Cost Anomaly Detection email subscription active
  • Bedrock model-access approval submitted
  • 8h reconnect behavior tested via forced microVM swap
  • Spike findings documented
  • Explicit Stack B / Stack A decision recorded
  • IaC strategy for EPIC-4 noted in findings doc (rebuild in CDK)

Notes / open questions

Locked:

  • Container architecture: arm64 (cheaper, AgentCore supports both)
  • IaC strategy: console-click spike, tear down + rebuild in CDK for EPIC-4
  • Cross-subdomain Cognito SSO test (app.autri.aimcp.autri.ai) moved to EPIC-4 (since app.autri.ai doesn't exist until then)
  • AWS root-user MFA: 1Password authenticator for v1 (Dan already uses 1Password); hardware key deferred to post-beta when revenue justifies

Still open (to validate empirically during spike):

  • Does AgentCore Runtime offer a stable custom-domain mapping, or does endpoint URL change per redeploy? (Verify in Step 3 before DNS)
  • If Cognito hosted UI feels too ugly for the spike, document as a v1.1 fix-up; not a blocker.
  • AWS account billing card swap to Mercury — likely happens ~5 days after spike, parallel to EPIC-2/EPIC-3 work

Review

🔒

Enter your access token to view annotations