Foundry Foundry

EPIC-1 Spike Findings: AgentCore Runtime POC

Written 2026-05-21 end-of-spike. Decision: Stack B LOCKED. AgentCore Runtime validated end-to-end against our actual production stack (Node/TS + @modelcontextprotocol/sdk@1.29 + Cognito JWKS).


Spike artifacts (resource references)

Operational findings (real surprises worth capturing)

Decision

Stack B is locked. Proceed with EPIC-4 production deploy targeting AgentCore Runtime + Amplify + Fargate Tasks + RDS + Cognito.

Stack A (all-Fargate fallback) is no longer needed. AgentCore Runtime works for our use case with documented operational findings (below).


What was validated (the wedge gate, empirically)

ComponentStatus
AWS Organization + autri-prod sub-account with centralized root access
IAM admin user + MFA (1Password TOTP) in both management and autri-prod
ACM cert for mcp.autri.ai (Cloudflare DNS validation)
AWS Budgets ($50/$100/$200/forecast $220) at management account
Cost Anomaly Detection ($10 daily threshold)
Cognito user pool + Google federated IdP
Cognito hosted UI (autri-auth.auth.us-east-1.amazoncognito.com)
OAuth Resource Server (mcp.autri.ai with mcp.invoke scope)
App client autri-mcp-client (public, code flow + PKCE)
@modelcontextprotocol/sdk@1.29 Streamable HTTP transport on Node 22 arm64
Defense-in-depth CognitoJwksAuth pattern (server validates JWT AgentCore already validated)
AgentCore Runtime deployment (MCP serverProtocol)
Cognito JWT → AgentCore customJWTAuthorizer validation
Tool call (hello) end-to-end with real Cognito JWT
CloudWatch logging + metrics dashboard
MicroVM swap test — zero downtime across rollover

End-to-end proof:

$ curl -X POST <agentcore-url> -H "Authorization: Bearer <cognito-jwt>" -d '{tools/call hello}'
→ "Hello, EPIC-1 spike end-to-end! Autri spike MCP server is alive on AgentCore Runtime.
   Authenticated sub: 6u1mitro1km7h6qjt4l0t23fpd. Scope: mcp.autri.ai/mcp.invoke."

F1 — AgentCore strips Authorization header by default

Out of the box, AgentCore Runtime's customJWTAuthorizer consumes the Authorization header — it does NOT forward to the container. Our container's defense-in-depth Cognito JWKS validation fails with 401 "Missing Authorization header" until you explicitly add Authorization to requestHeaderConfiguration.requestHeaderAllowlist.

Implication for EPIC-3: Defense-in-depth pattern requires explicit header-forwarding config. Trade-off: per-request JWT validation cost in the container (~1ms via cached JWKS) for finer-grained authz logic (scope checks, custom claim handling).

Recommendation: Keep defense-in-depth on for production. AgentCore's gateway-level validation is coarse (issuer + audience + scope + clients); container-level adds scope/claim flexibility for connector-id-in-JWT pattern (see F4).

F2 — Cognito doesn't expose RFC 8414 OAuth metadata

https://cognito-idp.us-east-1.amazonaws.com/{user-pool-id}/.well-known/oauth-authorization-server returns HTTP 400 (BadRequest). Cognito only exposes /.well-known/openid-configuration (OIDC discovery).

The MCP spec OAuth flow (2025-06+ and current Claude.ai Custom Connector implementation) requires RFC 8414 metadata. Both mcp-remote and Claude.ai's backend Custom Connector failed at the OAuth discovery step:

  • mcp-remote: HTTP 404: Invalid OAuth error response: ... Invalid api path
  • Claude.ai Custom Connector: Authorization with the MCP server failed. ofid_87082e1930838e50

Implication for EPIC-3: Need an OAuth metadata proxy or alternative pattern. Options ranked:

OptionEffortTrade-off
Lambda + API Gateway proxy that transforms OIDC → RFC 8414 metadata~1-2 daysStays with Cognito; minimal infra
Migrate to AWS AgentCore Identity (newer service)unknown / spike-neededPurpose-built but new
Replace Cognito hosted UI with custom Express+jose OAuth server~3-5 daysMost flexible, most ops surface
Drop Cognito for Auth0/Okta~1-2 daysVendor switch; cost adds up

Recommendation: RFC 8414 proxy via Lambda. Cheapest, preserves existing Cognito investment, fits CDK rebuild for EPIC-4.

F3 — AgentCore Runtime IDs are auto-suffixed; URL not stable across recreations

The runtime ARN is arn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk — the -knKc5zFWhk is an auto-generated suffix. Delete + recreate gives a different suffix → different URL. The path-encoded ARN in the invocation URL means standard CNAMEs can't proxy it.

Implication for EPIC-4: Custom domain (mcp.autri.ai) requires a path-rewriting proxy. CloudFront with origin path policy is the natural fit (ACM cert in us-east-1 we provisioned plugs in directly). Alternative: Cloudflare Worker.

Validated within a single runtime lifetime: URL is stable across container image updates and env-var changes — only the creation gets a fresh ID.

F4 — Single AgentCore URL ≠ existing path-based connector scheme

Autri's current MCP server uses /c/:connectorId/mcp (path-based per-connector routing — see autri/mcp-servers/doc-search/src/server.ts). AgentCore Runtime exposes ONE URL per runtime (no path param support).

Implication for EPIC-3: Need a new routing pattern. Options ranked:

OptionNotes
Connector ID as JWT custom claimCleanest. Mint-token flow adds connector_id claim; server reads from token. Preserves per-request scope binding.
Connector ID as custom HTTP header (X-Autri-Connector-Id)Works but separates auth from connector identity. Allowlist required.
One AgentCore runtime per connectorOperationally heavy at scale.
Encode connector ID in OAuth scope (mcp.autri.ai/c-{uuid}/invoke)Weird, doesn't scale, breaks scope semantics.

Recommendation: Custom JWT claim. Aligns with Cognito's app-client / pre-token-generation Lambda trigger pattern.

F5 — MicroVM swaps are transparent for stateless workloads

5 PRE-swap + 5 DURING-swap + 5 POST-swap curl calls all returned HTTP 200. AgentCore drains old microVMs while new ones spin up; the client sees zero observable disruption. This validates D34's "stateless + JSON responses" choice — no client-side reconnect logic needed for our pattern.

Implication: EPIC-3 can confidently stay stateless + JSON for v1. Re-evaluate stateful mode only if a real use case surfaces (multi-step tool sessions, streaming progress notifications mid-tool).

F6 — AgentCore can serve traffic during UPDATING state

Counter to expectation, AgentCore continued serving requests via existing microVMs while the runtime's status was UPDATING. Blue/green rollover is built-in; no maintenance window needed for env var or image updates.

F7 — Cognito user pool isn't tracked by CloudTrail for /oauth2/token calls

CloudTrail captures management-plane API calls (CreateUserPool, etc.) but NOT user-facing OAuth endpoint calls (/oauth2/authorize, /oauth2/token). Debugging OAuth issues requires either direct response inspection or enabling Cognito Advanced Security (paid feature).

Implication: For EPIC-3 OAuth debugging, the metadata proxy (F2) gives us a chokepoint where we can add structured logging.

F8 — Container-side per-call logging gap

Our spike server didn't log per-request events (only startup messages). Made debugging the F1 Authorization-strip issue harder than necessary.

Implication for EPIC-3: Add a request log line (method, sub, tool name, result status, latency) to the MCP server. Already exists in autri's existing server.ts via audit log writes, but add stdout logging for CloudWatch visibility too.


EPIC-3 follow-up tasks (compounded from spike)

These are concrete EPIC-3 tasks the spike surfaced. EPIC-3's local-wedge-gate goal was already met last session; this list is the AgentCore-readiness pass that bridges EPIC-3's local server to EPIC-4's production deploy.

  1. Swap HS256DevAuth → CognitoJwksAuth in autri/mcp-servers/doc-search/src/auth.ts. Spike's CognitoJwksAuth class drops in via the existing AuthVerifier interface. Add buildAuthVerifier() switching on AGENTCORE_AUTH=cognito|hs256-dev env var. (~30 min)
  2. Add Dockerfile to autri/mcp-servers/doc-search/. Lift spike's multi-stage arm64 pattern; adjust for pnpm workspace deps (@autri/retrieval, @autri/db). (~30 min)
  3. Implement connector-ID-in-JWT pattern (F4). Update dev:make-token to include connector_id claim; update server to read from token instead of path. Add Cognito Pre-Token-Generation Lambda for production. (~1-2 hrs design + implementation)
  4. OAuth metadata proxy for Cognito (F2). Lambda + API Gateway that exposes RFC 8414 metadata fronting Cognito. Tested against mcp-remote and claude.ai Custom Connector. (~1-2 days)
  5. Add stdout request logging to MCP server (F8). One line per request: method, sub, tool, status, latency. (~30 min)
  6. PKCE enforcement on production app client. Current autri-mcp-client allows code flow without PKCE; lock down for production. (~10 min)

Total: ~2-3 days of focused work to make the existing MCP server AgentCore-deployable.

Open question: how to slice this work — amendment to EPIC-3, new EPIC-3.5, or folded into EPIC-4 since it's all production-deploy prep? Lean: fold into EPIC-4 since EPIC-3 already met its locally-defined goal (wedge gate passed last session) and the new work is fundamentally production-deploy.


EPIC-4 follow-up tasks (new from spike)

  1. CloudFront in front of AgentCore Runtime for mcp.autri.ai custom domain (F3). Origin path policy rewrites //runtimes/{encoded-arn}/invocations?qualifier=DEFAULT. ACM cert in us-east-1 already provisioned.
  2. CDK module for AgentCore Runtime with our config pattern: requestHeaderConfiguration.requestHeaderAllowlist=["Authorization"], customJWTAuthorizer.allowedClients + allowedScopes, env vars for auth mode + Cognito issuer.
  3. Tear down spike artifacts before EPIC-4 starts: AgentCore Runtime, ECR repo, IAM role, M2M test client (autri-spike-m2m-test). Keep: user pool, hosted UI domain, primary app client, ACM cert.
  4. Migrate dev secrets to Parameter Store SecureString (Anthropic API key, JWT signing keys for non-Cognito paths). Cognito-stored Google OAuth secret stays in Cognito.
  5. Production Cognito callback URLs: replace the spike's localhost entries with production-only callbacks; keep https://claude.ai/api/mcp/auth_callback etc.

What's deployed in autri-prod (878013574001)

ResourceIdentifier
Cognito User Poolus-east-1_7YgaDlZlB
Cognito Hosted UI Domainautri-auth.auth.us-east-1.amazoncognito.com
OAuth Resource Servermcp.autri.ai
OAuth Custom Scopemcp.autri.ai/mcp.invoke
Cognito App Client (user-facing, public)autri-mcp-client / 7o6ieurh03iccad2qncmuqt6qk
Cognito App Client (M2M test, confidential)autri-spike-m2m-test / 6u1mitro1km7h6qjt4l0t23fpd
Google IdPGoogle (federated, mapped attributes: email/sub/given_name/family_name/picture)
ACM Certmcp.autri.ai + *.mcp.autri.ai (us-east-1)
ECR Repo878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore
AgentCore Runtimeautri_spike_agentcore-knKc5zFWhk (v5, READY, MCP protocol)
AgentCore Runtime ARNarn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk
AgentCore Runtime Endpoint URLhttps://bedrock-agentcore.us-east-1.amazonaws.com/runtimes/<encoded-arn>/invocations?qualifier=DEFAULT
IAM Execution RoleAgentCoreRuntime-AutriSpike
CloudWatch Dashboardautri-spike-agentcore (us-east-1)
CloudWatch Log Group/aws/bedrock-agentcore/runtimes/autri_spike_agentcore-knKc5zFWhk-DEFAULT

What's deployed in management account (248094863498)

ResourceIdentifier
AWS Organizationo-m23dfb4u9r
Root OUr-lqwd
autri-prod sub-account878013574001 (email: dan+autri-prod@hannahlabs.ai)
AWS Budgetautri-monthly-cost ($200/mo + thresholds $50/$100/$200/forecast $220)
Cost Anomaly Detectionautri-anomaly-daily ($10 threshold)
Centralized Root AccessEnabled (member account root creds removable)

Spike code

ItemPath
Hello-world MCP server (Node/TS)~/Documents/Code/autri-spike-agentcore/
Container image878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore:v0.0.1
Auth pattern (transfers to EPIC-3)~/Documents/Code/autri-spike-agentcore/src/auth.ts
Dockerfile (transfers to EPIC-3)~/Documents/Code/autri-spike-agentcore/Dockerfile

Files to clean up before EPIC-4

  • All resources in the "autri-prod deployed" table marked spike-only (keep user pool, hosted UI domain, primary app client, ACM cert; remove the rest)
  • ~/Documents/Code/autri-spike-agentcore/ directory (after extracting auth.ts + Dockerfile into autri repo)

IaC strategy for EPIC-4

Spike artifacts are console-clicked. EPIC-4 rebuilds in AWS CDK (TypeScript). Mapping:

Spike resourceEPIC-4 CDK construct
Cognito User Pool + Hosted UIaws-cdk-lib/aws-cognito.UserPool + UserPoolDomain
Google IdP federationUserPoolIdentityProviderGoogle
OAuth Resource ServerUserPool.addResourceServer
App ClientUserPool.addClient
ACM certaws-cdk-lib/aws-certificatemanager.Certificate + DNS validation
ECR Repoaws-cdk-lib/aws-ecr.Repository
AgentCore Runtime IAM roleaws-cdk-lib/aws-iam.Role with bedrock-agentcore trust policy
AgentCore RuntimeL1 CfnAgentRuntime construct (or custom resource if no high-level construct yet)
CloudFront + custom domainaws-cdk-lib/aws-cloudfront.Distribution with origin path rewrite
CloudWatch Dashboardaws-cdk-lib/aws-cloudwatch.Dashboard
Budgets + Anomaly Detectionaws-cdk-lib/aws-budgets.CfnBudget + CfnAnomalyMonitor

OAuth metadata proxy (F2) is a new EPIC-4 module: API Gateway HTTP API + Lambda function with the discovery transformation logic.


Carrying forward to next.md

  • ✅ Stack B LOCKED — EPIC-4 starts with confidence
  • ⏳ EPIC-3 AgentCore-readiness pass (~2-3 days) before EPIC-4 deploy work
  • ⏳ Decide: fold into EPIC-4 vs new EPIC-3.5 (lean: fold into EPIC-4)
  • ⏳ OAuth metadata proxy design — Foundry-refine before implementing
  • ⏳ Connector-ID-in-JWT pattern — design + Cognito Pre-Token-Generation Lambda
  • 🗑️ Spike teardown plan (whitelist of resources to delete after EPIC-4 deploy is up)

Review

🔒

Enter your access token to view annotations