EPIC-1 Spike Findings: AgentCore Runtime POC

Written 2026-05-21 end-of-spike. Decision: Stack B LOCKED. AgentCore Runtime validated end-to-end against our actual production stack (Node/TS + @modelcontextprotocol/sdk@1.29 + Cognito JWKS).

Spike artifacts (resource references)

Operational findings (real surprises worth capturing)

Decision

Stack B is locked. Proceed with EPIC-4 production deploy targeting AgentCore Runtime + Amplify + Fargate Tasks + RDS + Cognito.

Stack A (all-Fargate fallback) is no longer needed. AgentCore Runtime works for our use case with documented operational findings (below).

What was validated (the wedge gate, empirically)

Component	Status
AWS Organization + autri-prod sub-account with centralized root access	✅
IAM admin user + MFA (1Password TOTP) in both management and autri-prod	✅
ACM cert for `mcp.autri.ai` (Cloudflare DNS validation)	✅
AWS Budgets ($50/$100/$200/forecast $220) at management account	✅
Cost Anomaly Detection ($10 daily threshold)	✅
Cognito user pool + Google federated IdP	✅
Cognito hosted UI (`autri-auth.auth.us-east-1.amazoncognito.com`)	✅
OAuth Resource Server (`mcp.autri.ai` with `mcp.invoke` scope)	✅
App client `autri-mcp-client` (public, code flow + PKCE)	✅
`@modelcontextprotocol/sdk@1.29` Streamable HTTP transport on Node 22 arm64	✅
Defense-in-depth CognitoJwksAuth pattern (server validates JWT AgentCore already validated)	✅
AgentCore Runtime deployment (MCP serverProtocol)	✅
Cognito JWT → AgentCore customJWTAuthorizer validation	✅
Tool call (`hello`) end-to-end with real Cognito JWT	✅
CloudWatch logging + metrics dashboard	✅
MicroVM swap test — zero downtime across rollover	✅

End-to-end proof:

$ curl -X POST <agentcore-url> -H "Authorization: Bearer <cognito-jwt>" -d '{tools/call hello}'
→ "Hello, EPIC-1 spike end-to-end! Autri spike MCP server is alive on AgentCore Runtime.
   Authenticated sub: 6u1mitro1km7h6qjt4l0t23fpd. Scope: mcp.autri.ai/mcp.invoke."

F1 — AgentCore strips Authorization header by default

Out of the box, AgentCore Runtime's customJWTAuthorizer consumes the Authorization header — it does NOT forward to the container. Our container's defense-in-depth Cognito JWKS validation fails with 401 "Missing Authorization header" until you explicitly add Authorization to requestHeaderConfiguration.requestHeaderAllowlist.

Implication for EPIC-3: Defense-in-depth pattern requires explicit header-forwarding config. Trade-off: per-request JWT validation cost in the container (~1ms via cached JWKS) for finer-grained authz logic (scope checks, custom claim handling).

Recommendation: Keep defense-in-depth on for production. AgentCore's gateway-level validation is coarse (issuer + audience + scope + clients); container-level adds scope/claim flexibility for connector-id-in-JWT pattern (see F4).

F2 — Cognito doesn't expose RFC 8414 OAuth metadata

https://cognito-idp.us-east-1.amazonaws.com/{user-pool-id}/.well-known/oauth-authorization-server returns HTTP 400 (BadRequest). Cognito only exposes /.well-known/openid-configuration (OIDC discovery).

The MCP spec OAuth flow (2025-06+ and current Claude.ai Custom Connector implementation) requires RFC 8414 metadata. Both mcp-remote and Claude.ai's backend Custom Connector failed at the OAuth discovery step:

mcp-remote: HTTP 404: Invalid OAuth error response: ... Invalid api path
Claude.ai Custom Connector: Authorization with the MCP server failed. ofid_87082e1930838e50

Implication for EPIC-3: Need an OAuth metadata proxy or alternative pattern. Options ranked:

Option	Effort	Trade-off
Lambda + API Gateway proxy that transforms OIDC → RFC 8414 metadata	~1-2 days	Stays with Cognito; minimal infra
Migrate to AWS AgentCore Identity (newer service)	unknown / spike-needed	Purpose-built but new
Replace Cognito hosted UI with custom Express+jose OAuth server	~3-5 days	Most flexible, most ops surface
Drop Cognito for Auth0/Okta	~1-2 days	Vendor switch; cost adds up

Recommendation: RFC 8414 proxy via Lambda. Cheapest, preserves existing Cognito investment, fits CDK rebuild for EPIC-4.

F3 — AgentCore Runtime IDs are auto-suffixed; URL not stable across recreations

The runtime ARN is arn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk — the -knKc5zFWhk is an auto-generated suffix. Delete + recreate gives a different suffix → different URL. The path-encoded ARN in the invocation URL means standard CNAMEs can't proxy it.

Implication for EPIC-4: Custom domain (mcp.autri.ai) requires a path-rewriting proxy. CloudFront with origin path policy is the natural fit (ACM cert in us-east-1 we provisioned plugs in directly). Alternative: Cloudflare Worker.

Validated within a single runtime lifetime: URL is stable across container image updates and env-var changes — only the creation gets a fresh ID.

F4 — Single AgentCore URL ≠ existing path-based connector scheme

Autri's current MCP server uses /c/:connectorId/mcp (path-based per-connector routing — see autri/mcp-servers/doc-search/src/server.ts). AgentCore Runtime exposes ONE URL per runtime (no path param support).

Implication for EPIC-3: Need a new routing pattern. Options ranked:

Option	Notes
Connector ID as JWT custom claim	Cleanest. Mint-token flow adds `connector_id` claim; server reads from token. Preserves per-request scope binding.
Connector ID as custom HTTP header (`X-Autri-Connector-Id`)	Works but separates auth from connector identity. Allowlist required.
One AgentCore runtime per connector	Operationally heavy at scale.
Encode connector ID in OAuth scope (`mcp.autri.ai/c-{uuid}/invoke`)	Weird, doesn't scale, breaks scope semantics.

Recommendation: Custom JWT claim. Aligns with Cognito's app-client / pre-token-generation Lambda trigger pattern.

F5 — MicroVM swaps are transparent for stateless workloads

5 PRE-swap + 5 DURING-swap + 5 POST-swap curl calls all returned HTTP 200. AgentCore drains old microVMs while new ones spin up; the client sees zero observable disruption. This validates D34's "stateless + JSON responses" choice — no client-side reconnect logic needed for our pattern.

Implication: EPIC-3 can confidently stay stateless + JSON for v1. Re-evaluate stateful mode only if a real use case surfaces (multi-step tool sessions, streaming progress notifications mid-tool).

F6 — AgentCore can serve traffic during UPDATING state

Counter to expectation, AgentCore continued serving requests via existing microVMs while the runtime's status was UPDATING. Blue/green rollover is built-in; no maintenance window needed for env var or image updates.

F7 — Cognito user pool isn't tracked by CloudTrail for `/oauth2/token` calls

CloudTrail captures management-plane API calls (CreateUserPool, etc.) but NOT user-facing OAuth endpoint calls (/oauth2/authorize, /oauth2/token). Debugging OAuth issues requires either direct response inspection or enabling Cognito Advanced Security (paid feature).

Implication: For EPIC-3 OAuth debugging, the metadata proxy (F2) gives us a chokepoint where we can add structured logging.

F8 — Container-side per-call logging gap

Our spike server didn't log per-request events (only startup messages). Made debugging the F1 Authorization-strip issue harder than necessary.

Implication for EPIC-3: Add a request log line (method, sub, tool name, result status, latency) to the MCP server. Already exists in autri's existing server.ts via audit log writes, but add stdout logging for CloudWatch visibility too.

EPIC-3 follow-up tasks (compounded from spike)

These are concrete EPIC-3 tasks the spike surfaced. EPIC-3's local-wedge-gate goal was already met last session; this list is the AgentCore-readiness pass that bridges EPIC-3's local server to EPIC-4's production deploy.

Swap HS256DevAuth → CognitoJwksAuth in autri/mcp-servers/doc-search/src/auth.ts. Spike's CognitoJwksAuth class drops in via the existing AuthVerifier interface. Add buildAuthVerifier() switching on AGENTCORE_AUTH=cognito|hs256-dev env var. (~30 min)
Add Dockerfile to autri/mcp-servers/doc-search/. Lift spike's multi-stage arm64 pattern; adjust for pnpm workspace deps (@autri/retrieval, @autri/db). (~30 min)
Implement connector-ID-in-JWT pattern (F4). Update dev:make-token to include connector_id claim; update server to read from token instead of path. Add Cognito Pre-Token-Generation Lambda for production. (~1-2 hrs design + implementation)
OAuth metadata proxy for Cognito (F2). Lambda + API Gateway that exposes RFC 8414 metadata fronting Cognito. Tested against mcp-remote and claude.ai Custom Connector. (~1-2 days)
Add stdout request logging to MCP server (F8). One line per request: method, sub, tool, status, latency. (~30 min)
PKCE enforcement on production app client. Current autri-mcp-client allows code flow without PKCE; lock down for production. (~10 min)

Total: ~2-3 days of focused work to make the existing MCP server AgentCore-deployable.

Open question: how to slice this work — amendment to EPIC-3, new EPIC-3.5, or folded into EPIC-4 since it's all production-deploy prep? Lean: fold into EPIC-4 since EPIC-3 already met its locally-defined goal (wedge gate passed last session) and the new work is fundamentally production-deploy.

EPIC-4 follow-up tasks (new from spike)

CloudFront in front of AgentCore Runtime for mcp.autri.ai custom domain (F3). Origin path policy rewrites / → /runtimes/{encoded-arn}/invocations?qualifier=DEFAULT. ACM cert in us-east-1 already provisioned.
CDK module for AgentCore Runtime with our config pattern: requestHeaderConfiguration.requestHeaderAllowlist=["Authorization"], customJWTAuthorizer.allowedClients + allowedScopes, env vars for auth mode + Cognito issuer.
Tear down spike artifacts before EPIC-4 starts: AgentCore Runtime, ECR repo, IAM role, M2M test client (autri-spike-m2m-test). Keep: user pool, hosted UI domain, primary app client, ACM cert.
Migrate dev secrets to Parameter Store SecureString (Anthropic API key, JWT signing keys for non-Cognito paths). Cognito-stored Google OAuth secret stays in Cognito.
Production Cognito callback URLs: replace the spike's localhost entries with production-only callbacks; keep https://claude.ai/api/mcp/auth_callback etc.

What's deployed in autri-prod (878013574001)

Resource	Identifier
Cognito User Pool	`us-east-1_7YgaDlZlB`
Cognito Hosted UI Domain	`autri-auth.auth.us-east-1.amazoncognito.com`
OAuth Resource Server	`mcp.autri.ai`
OAuth Custom Scope	`mcp.autri.ai/mcp.invoke`
Cognito App Client (user-facing, public)	`autri-mcp-client` / `7o6ieurh03iccad2qncmuqt6qk`
Cognito App Client (M2M test, confidential)	`autri-spike-m2m-test` / `6u1mitro1km7h6qjt4l0t23fpd`
Google IdP	`Google` (federated, mapped attributes: email/sub/given_name/family_name/picture)
ACM Cert	`mcp.autri.ai` + `*.mcp.autri.ai` (us-east-1)
ECR Repo	`878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore`
AgentCore Runtime	`autri_spike_agentcore-knKc5zFWhk` (v5, READY, MCP protocol)
AgentCore Runtime ARN	`arn:aws:bedrock-agentcore:us-east-1:878013574001:runtime/autri_spike_agentcore-knKc5zFWhk`
AgentCore Runtime Endpoint URL	`https://bedrock-agentcore.us-east-1.amazonaws.com/runtimes/<encoded-arn>/invocations?qualifier=DEFAULT`
IAM Execution Role	`AgentCoreRuntime-AutriSpike`
CloudWatch Dashboard	`autri-spike-agentcore` (us-east-1)
CloudWatch Log Group	`/aws/bedrock-agentcore/runtimes/autri_spike_agentcore-knKc5zFWhk-DEFAULT`

What's deployed in management account (248094863498)

Resource	Identifier
AWS Organization	`o-m23dfb4u9r`
Root OU	`r-lqwd`
autri-prod sub-account	`878013574001` (email: `dan+autri-prod@hannahlabs.ai`)
AWS Budget	`autri-monthly-cost` ($200/mo + thresholds $50/$100/$200/forecast $220)
Cost Anomaly Detection	`autri-anomaly-daily` ($10 threshold)
Centralized Root Access	Enabled (member account root creds removable)

Spike code

Item	Path
Hello-world MCP server (Node/TS)	`~/Documents/Code/autri-spike-agentcore/`
Container image	`878013574001.dkr.ecr.us-east-1.amazonaws.com/autri-spike-agentcore:v0.0.1`
Auth pattern (transfers to EPIC-3)	`~/Documents/Code/autri-spike-agentcore/src/auth.ts`
Dockerfile (transfers to EPIC-3)	`~/Documents/Code/autri-spike-agentcore/Dockerfile`

Files to clean up before EPIC-4

All resources in the "autri-prod deployed" table marked spike-only (keep user pool, hosted UI domain, primary app client, ACM cert; remove the rest)
~/Documents/Code/autri-spike-agentcore/ directory (after extracting auth.ts + Dockerfile into autri repo)

IaC strategy for EPIC-4

Spike artifacts are console-clicked. EPIC-4 rebuilds in AWS CDK (TypeScript). Mapping:

Spike resource	EPIC-4 CDK construct
Cognito User Pool + Hosted UI	`aws-cdk-lib/aws-cognito.UserPool` + `UserPoolDomain`
Google IdP federation	`UserPoolIdentityProviderGoogle`
OAuth Resource Server	`UserPool.addResourceServer`
App Client	`UserPool.addClient`
ACM cert	`aws-cdk-lib/aws-certificatemanager.Certificate` + DNS validation
ECR Repo	`aws-cdk-lib/aws-ecr.Repository`
AgentCore Runtime IAM role	`aws-cdk-lib/aws-iam.Role` with bedrock-agentcore trust policy
AgentCore Runtime	L1 `CfnAgentRuntime` construct (or custom resource if no high-level construct yet)
CloudFront + custom domain	`aws-cdk-lib/aws-cloudfront.Distribution` with origin path rewrite
CloudWatch Dashboard	`aws-cdk-lib/aws-cloudwatch.Dashboard`
Budgets + Anomaly Detection	`aws-cdk-lib/aws-budgets.CfnBudget` + `CfnAnomalyMonitor`

OAuth metadata proxy (F2) is a new EPIC-4 module: API Gateway HTTP API + Lambda function with the discovery transformation logic.

Carrying forward to next.md

✅ Stack B LOCKED — EPIC-4 starts with confidence
⏳ EPIC-3 AgentCore-readiness pass (~2-3 days) before EPIC-4 deploy work
⏳ Decide: fold into EPIC-4 vs new EPIC-3.5 (lean: fold into EPIC-4)
⏳ OAuth metadata proxy design — Foundry-refine before implementing
⏳ Connector-ID-in-JWT pattern — design + Cognito Pre-Token-Generation Lambda
🗑️ Spike teardown plan (whitelist of resources to delete after EPIC-4 deploy is up)

EPIC-1 Spike Findings: AgentCore Runtime POC#

Spike artifacts (resource references)#

Operational findings (real surprises worth capturing)#

Decision#

What was validated (the wedge gate, empirically)#

F1 — AgentCore strips Authorization header by default#

F2 — Cognito doesn't expose RFC 8414 OAuth metadata#

F3 — AgentCore Runtime IDs are auto-suffixed; URL not stable across recreations#

F4 — Single AgentCore URL ≠ existing path-based connector scheme#

F5 — MicroVM swaps are transparent for stateless workloads#

F6 — AgentCore can serve traffic during UPDATING state#

F7 — Cognito user pool isn't tracked by CloudTrail for /oauth2/token calls#

F8 — Container-side per-call logging gap#

EPIC-3 follow-up tasks (compounded from spike)#

EPIC-4 follow-up tasks (new from spike)#

What's deployed in autri-prod (878013574001)#

What's deployed in management account (248094863498)#

Spike code#

Files to clean up before EPIC-4#

IaC strategy for EPIC-4#

Carrying forward to next.md#

Review