EPIC-1 Spike Findings: AgentCore Runtime POC
Written 2026-05-21 end-of-spike. Decision: Stack B LOCKED. AgentCore Runtime validated end-to-end against our actual production stack (Node/TS + @modelcontextprotocol/sdk@1.29 + Cognito JWKS).
Spike artifacts (resource references)
Operational findings (real surprises worth capturing)
Decision
Stack B is locked. Proceed with EPIC-4 production deploy targeting AgentCore Runtime + Amplify + Fargate Tasks + RDS + Cognito.
Stack A (all-Fargate fallback) is no longer needed. AgentCore Runtime works for our use case with documented operational findings (below).
What was validated (the wedge gate, empirically)
| Component | Status |
|---|---|
| AWS Organization + autri-prod sub-account with centralized root access | ✅ |
| IAM admin user + MFA (1Password TOTP) in both management and autri-prod | ✅ |
ACM cert for mcp.autri.ai (Cloudflare DNS validation) | ✅ |
| AWS Budgets ($50/$100/$200/forecast $220) at management account | ✅ |
| Cost Anomaly Detection ($10 daily threshold) | ✅ |
| Cognito user pool + Google federated IdP | ✅ |
Cognito hosted UI (autri-auth.auth.us-east-1.amazoncognito.com) | ✅ |
OAuth Resource Server (mcp.autri.ai with mcp.invoke scope) | ✅ |
App client autri-mcp-client (public, code flow + PKCE) | ✅ |
@modelcontextprotocol/sdk@1.29 Streamable HTTP transport on Node 22 arm64 | ✅ |
| Defense-in-depth CognitoJwksAuth pattern (server validates JWT AgentCore already validated) | ✅ |
| AgentCore Runtime deployment (MCP serverProtocol) | ✅ |
| Cognito JWT → AgentCore customJWTAuthorizer validation | ✅ |
Tool call (hello) end-to-end with real Cognito JWT | ✅ |
| CloudWatch logging + metrics dashboard | ✅ |
| MicroVM swap test — zero downtime across rollover | ✅ |
End-to-end proof:
$ curl -X POST <agentcore-url> -H "Authorization: Bearer <cognito-jwt>" -d '{tools/call hello}'
→ "Hello, EPIC-1 spike end-to-end! Autri spike MCP server is alive on AgentCore Runtime.
Authenticated sub: 6u1mitro1km7h6qjt4l0t23fpd. Scope: mcp.autri.ai/mcp.invoke."
F1 — AgentCore strips Authorization header by default
Out of the box, AgentCore Runtime's customJWTAuthorizer consumes the Authorization header — it does NOT forward to the container. Our container's defense-in-depth Cognito JWKS validation fails with 401 "Missing Authorization header" until you explicitly add Authorization to requestHeaderConfiguration.requestHeaderAllowlist.
Implication for EPIC-3: Defense-in-depth pattern requires explicit header-forwarding config. Trade-off: per-request JWT validation cost in the container (~1ms via cached JWKS) for finer-grained authz logic (scope checks, custom claim handling).
Recommendation: Keep defense-in-depth on for production. AgentCore's gateway-level validation is coarse (issuer + audience + scope + clients); container-level adds scope/claim flexibility for connector-id-in-JWT pattern (see F4).
F2 — Cognito doesn't expose RFC 8414 OAuth metadata
https://cognito-idp.us-east-1.amazonaws.com/{user-pool-id}/.well-known/oauth-authorization-server returns HTTP 400 (BadRequest). Cognito only exposes /.well-known/openid-configuration (OIDC discovery).
The MCP spec OAuth flow (2025-06+ and current Claude.ai Custom Connector implementation) requires RFC 8414 metadata. Both mcp-remote and Claude.ai's backend Custom Connector failed at the OAuth discovery step:
- mcp-remote:
HTTP 404: Invalid OAuth error response: ... Invalid api path - Claude.ai Custom Connector:
Authorization with the MCP server failed. ofid_87082e1930838e50
Implication for EPIC-3: Need an OAuth metadata proxy or alternative pattern. Options ranked:
| Option | Effort | Trade-off |
|---|---|---|
| Lambda + API Gateway proxy that transforms OIDC → RFC 8414 metadata | ~1-2 days | Stays with Cognito; minimal infra |
| Migrate to AWS AgentCore Identity (newer service) | unknown / spike-needed | Purpose-built but new |
| Replace Cognito hosted UI with custom Express+jose OAuth server | ~3-5 days | Most flexible, most ops surface |
| Drop Cognito for Auth0/Okta | ~1-2 days | Vendor switch; cost adds up |
Recommendation: RFC 8414 proxy via Lambda. Cheapest, preserves existing Cognito investment, fits CDK rebuild for EPIC-4.
F3 — AgentCore Runtime IDs are auto-suffixed; URL not stable across recreations
The runtime ARN is arn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk — the -knKc5zFWhk is an auto-generated suffix. Delete + recreate gives a different suffix → different URL. The path-encoded ARN in the invocation URL means standard CNAMEs can't proxy it.
Implication for EPIC-4: Custom domain (mcp.autri.ai) requires a path-rewriting proxy. CloudFront with origin path policy is the natural fit (ACM cert in us-east-1 we provisioned plugs in directly). Alternative: Cloudflare Worker.
Validated within a single runtime lifetime: URL is stable across container image updates and env-var changes — only the creation gets a fresh ID.
F4 — Single AgentCore URL ≠ existing path-based connector scheme
Autri's current MCP server uses /c/:connectorId/mcp (path-based per-connector routing — see autri/mcp-servers/doc-search/src/server.ts). AgentCore Runtime exposes ONE URL per runtime (no path param support).
Implication for EPIC-3: Need a new routing pattern. Options ranked:
| Option | Notes |
|---|---|
| Connector ID as JWT custom claim | Cleanest. Mint-token flow adds connector_id claim; server reads from token. Preserves per-request scope binding. |
Connector ID as custom HTTP header (X-Autri-Connector-Id) | Works but separates auth from connector identity. Allowlist required. |
| One AgentCore runtime per connector | Operationally heavy at scale. |
Encode connector ID in OAuth scope (mcp.autri.ai/c-{uuid}/invoke) | Weird, doesn't scale, breaks scope semantics. |
Recommendation: Custom JWT claim. Aligns with Cognito's app-client / pre-token-generation Lambda trigger pattern.
F5 — MicroVM swaps are transparent for stateless workloads
5 PRE-swap + 5 DURING-swap + 5 POST-swap curl calls all returned HTTP 200. AgentCore drains old microVMs while new ones spin up; the client sees zero observable disruption. This validates D34's "stateless + JSON responses" choice — no client-side reconnect logic needed for our pattern.
Implication: EPIC-3 can confidently stay stateless + JSON for v1. Re-evaluate stateful mode only if a real use case surfaces (multi-step tool sessions, streaming progress notifications mid-tool).
F6 — AgentCore can serve traffic during UPDATING state
Counter to expectation, AgentCore continued serving requests via existing microVMs while the runtime's status was UPDATING. Blue/green rollover is built-in; no maintenance window needed for env var or image updates.
F7 — Cognito user pool isn't tracked by CloudTrail for /oauth2/token calls
CloudTrail captures management-plane API calls (CreateUserPool, etc.) but NOT user-facing OAuth endpoint calls (/oauth2/authorize, /oauth2/token). Debugging OAuth issues requires either direct response inspection or enabling Cognito Advanced Security (paid feature).
Implication: For EPIC-3 OAuth debugging, the metadata proxy (F2) gives us a chokepoint where we can add structured logging.
F8 — Container-side per-call logging gap
Our spike server didn't log per-request events (only startup messages). Made debugging the F1 Authorization-strip issue harder than necessary.
Implication for EPIC-3: Add a request log line (method, sub, tool name, result status, latency) to the MCP server. Already exists in autri's existing server.ts via audit log writes, but add stdout logging for CloudWatch visibility too.
EPIC-3 follow-up tasks (compounded from spike)
These are concrete EPIC-3 tasks the spike surfaced. EPIC-3's local-wedge-gate goal was already met last session; this list is the AgentCore-readiness pass that bridges EPIC-3's local server to EPIC-4's production deploy.
- Swap HS256DevAuth → CognitoJwksAuth in
autri/mcp-servers/doc-search/src/auth.ts. Spike'sCognitoJwksAuthclass drops in via the existingAuthVerifierinterface. AddbuildAuthVerifier()switching onAGENTCORE_AUTH=cognito|hs256-devenv var. (~30 min) - Add Dockerfile to
autri/mcp-servers/doc-search/. Lift spike's multi-stage arm64 pattern; adjust for pnpm workspace deps (@autri/retrieval,@autri/db). (~30 min) - Implement connector-ID-in-JWT pattern (F4). Update
dev:make-tokento includeconnector_idclaim; update server to read from token instead of path. Add Cognito Pre-Token-Generation Lambda for production. (~1-2 hrs design + implementation) - OAuth metadata proxy for Cognito (F2). Lambda + API Gateway that exposes RFC 8414 metadata fronting Cognito. Tested against mcp-remote and claude.ai Custom Connector. (~1-2 days)
- Add stdout request logging to MCP server (F8). One line per request: method, sub, tool, status, latency. (~30 min)
- PKCE enforcement on production app client. Current
autri-mcp-clientallows code flow without PKCE; lock down for production. (~10 min)
Total: ~2-3 days of focused work to make the existing MCP server AgentCore-deployable.
Open question: how to slice this work — amendment to EPIC-3, new EPIC-3.5, or folded into EPIC-4 since it's all production-deploy prep? Lean: fold into EPIC-4 since EPIC-3 already met its locally-defined goal (wedge gate passed last session) and the new work is fundamentally production-deploy.
EPIC-4 follow-up tasks (new from spike)
- CloudFront in front of AgentCore Runtime for
mcp.autri.aicustom domain (F3). Origin path policy rewrites/→/runtimes/{encoded-arn}/invocations?qualifier=DEFAULT. ACM cert in us-east-1 already provisioned. - CDK module for AgentCore Runtime with our config pattern:
requestHeaderConfiguration.requestHeaderAllowlist=["Authorization"],customJWTAuthorizer.allowedClients+allowedScopes, env vars for auth mode + Cognito issuer. - Tear down spike artifacts before EPIC-4 starts: AgentCore Runtime, ECR repo, IAM role, M2M test client (autri-spike-m2m-test). Keep: user pool, hosted UI domain, primary app client, ACM cert.
- Migrate dev secrets to Parameter Store SecureString (Anthropic API key, JWT signing keys for non-Cognito paths). Cognito-stored Google OAuth secret stays in Cognito.
- Production Cognito callback URLs: replace the spike's localhost entries with production-only callbacks; keep
https://claude.ai/api/mcp/auth_callbacketc.
What's deployed in autri-prod (878013574001)
| Resource | Identifier |
|---|---|
| Cognito User Pool | us-east-1_7YgaDlZlB |
| Cognito Hosted UI Domain | autri-auth.auth.us-east-1.amazoncognito.com |
| OAuth Resource Server | mcp.autri.ai |
| OAuth Custom Scope | mcp.autri.ai/mcp.invoke |
| Cognito App Client (user-facing, public) | autri-mcp-client / 7o6ieurh03iccad2qncmuqt6qk |
| Cognito App Client (M2M test, confidential) | autri-spike-m2m-test / 6u1mitro1km7h6qjt4l0t23fpd |
| Google IdP | Google (federated, mapped attributes: email/sub/given_name/family_name/picture) |
| ACM Cert | mcp.autri.ai + *.mcp.autri.ai (us-east-1) |
| ECR Repo | 878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore |
| AgentCore Runtime | autri_spike_agentcore-knKc5zFWhk (v5, READY, MCP protocol) |
| AgentCore Runtime ARN | arn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk |
| AgentCore Runtime Endpoint URL | https://bedrock-agentcore.us-east-1.amazonaws.com/runtimes/<encoded-arn>/invocations?qualifier=DEFAULT |
| IAM Execution Role | AgentCoreRuntime-AutriSpike |
| CloudWatch Dashboard | autri-spike-agentcore (us-east-1) |
| CloudWatch Log Group | /aws/bedrock-agentcore/runtimes/autri_spike_agentcore-knKc5zFWhk-DEFAULT |
What's deployed in management account (248094863498)
| Resource | Identifier |
|---|---|
| AWS Organization | o-m23dfb4u9r |
| Root OU | r-lqwd |
| autri-prod sub-account | 878013574001 (email: dan+autri-prod@hannahlabs.ai) |
| AWS Budget | autri-monthly-cost ($200/mo + thresholds $50/$100/$200/forecast $220) |
| Cost Anomaly Detection | autri-anomaly-daily ($10 threshold) |
| Centralized Root Access | Enabled (member account root creds removable) |
Spike code
| Item | Path |
|---|---|
| Hello-world MCP server (Node/TS) | ~/Documents/Code/autri-spike-agentcore/ |
| Container image | 878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore:v0.0.1 |
| Auth pattern (transfers to EPIC-3) | ~/Documents/Code/autri-spike-agentcore/src/auth.ts |
| Dockerfile (transfers to EPIC-3) | ~/Documents/Code/autri-spike-agentcore/Dockerfile |
Files to clean up before EPIC-4
- All resources in the "autri-prod deployed" table marked spike-only (keep user pool, hosted UI domain, primary app client, ACM cert; remove the rest)
~/Documents/Code/autri-spike-agentcore/directory (after extracting auth.ts + Dockerfile into autri repo)
IaC strategy for EPIC-4
Spike artifacts are console-clicked. EPIC-4 rebuilds in AWS CDK (TypeScript). Mapping:
| Spike resource | EPIC-4 CDK construct |
|---|---|
| Cognito User Pool + Hosted UI | aws-cdk-lib/aws-cognito.UserPool + UserPoolDomain |
| Google IdP federation | UserPoolIdentityProviderGoogle |
| OAuth Resource Server | UserPool.addResourceServer |
| App Client | UserPool.addClient |
| ACM cert | aws-cdk-lib/aws-certificatemanager.Certificate + DNS validation |
| ECR Repo | aws-cdk-lib/aws-ecr.Repository |
| AgentCore Runtime IAM role | aws-cdk-lib/aws-iam.Role with bedrock-agentcore trust policy |
| AgentCore Runtime | L1 CfnAgentRuntime construct (or custom resource if no high-level construct yet) |
| CloudFront + custom domain | aws-cdk-lib/aws-cloudfront.Distribution with origin path rewrite |
| CloudWatch Dashboard | aws-cdk-lib/aws-cloudwatch.Dashboard |
| Budgets + Anomaly Detection | aws-cdk-lib/aws-budgets.CfnBudget + CfnAnomalyMonitor |
OAuth metadata proxy (F2) is a new EPIC-4 module: API Gateway HTTP API + Lambda function with the discovery transformation logic.
Carrying forward to next.md
- ✅ Stack B LOCKED — EPIC-4 starts with confidence
- ⏳ EPIC-3 AgentCore-readiness pass (~2-3 days) before EPIC-4 deploy work
- ⏳ Decide: fold into EPIC-4 vs new EPIC-3.5 (lean: fold into EPIC-4)
- ⏳ OAuth metadata proxy design — Foundry-refine before implementing
- ⏳ Connector-ID-in-JWT pattern — design + Cognito Pre-Token-Generation Lambda
- 🗑️ Spike teardown plan (whitelist of resources to delete after EPIC-4 deploy is up)