Sub-system design doc for autri's ingestion + file-upload pipeline. Scopes EPIC-4.5, the carve-out between EPIC-4 (AWS deploy) and EPIC-5 (beta onboarding). Driven by the 2026-05-28 cloud-native audit which found 13 patterns that work locally but break on AWS Lambda (6MB request cap, read-only filesystem, no CLI binaries, fire-and-forget process death). Captured in cross-project memory as feedback_lambda_cloud_gotchas.md.
Drafted 2026-05-27 end-of-session, EPIC-4 P0 finish complete. Intentionally takes positions to give the next session's /hl:red-team pass real targets to attack. Working thesis: Browser-to-S3 presigned upload + SQS queue + Fargate worker. Three alternatives explicitly rejected; load-bearing decisions flagged in § Open Questions for Red-Team.
Risks & Constraints
| Risk / Constraint | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Compute-model cost surprises beta unit-economics | Low | Low | Both live options (Fargate ~$3–8/mo, all-Lambda ~$12/mo) are noise vs the ~$450/mo LLM bill. Budgets alarm regardless. |
| Worker cold start adds lag to first-doc-of-day | Medium | Low | Fargate ~30–60s; all-Lambda ~2–10s. Acceptable at beta either way; source the real Fargate number from the mcp.autri.ai container (D39) before assuming. |
| Worker crashes mid-extraction lose progress | High | Low | Resume from MAX(pages.id); idempotent via UNIQUE(document_id, content_hash) (RT-12). |
| Cancel propagation across worker boundary is racy | Medium | Medium | Worker polls documents.status between units; SELECT-FOR-UPDATE rejects re-enqueued duplicates. Mid-unit kill is v1.1. |
Lost S3 event leaves doc stuck at pending | Low | Medium | Stuck-pending janitor (RT-18, v1) re-enqueues rows >30 min old. |
| Presigned-URL abuse (arbitrary-bytes dump) | Medium | High | Presign enforces Content-Length-Range [0,100MB] + content-type pin (RT-14). |
| C1 — Anthropic rate limits hit harder without Max-plan cushion | Medium | Medium | Profile CHANGES CLI→SDK (per-key tier, not session — RT-13). Confirm tier in Console (BT-2). Global ceiling is free under all-Lambda (reserved concurrency); v1.1 distributed bucket under Fargate. Circuit breaker (D31) already exists. |
| C2 — Worker IAM creep ("just give it admin") | Medium | High | Scoped to: read uploads bucket, write cache bucket, RDS connect, Secrets Manager read on the Anthropic key only, SQS receive/delete on one queue. No Cognito, no other S3, no Lambda invoke. Dedicated IAM review at implementation (RT-5). |
| C3 — Shared VPC pins worker blast radius to web Lambdas | Low | Low | Same subnets + NAT as web Lambdas (no new $32/mo NAT). Isolation would cost ~$32/mo per separate NAT — not worth it at beta. Named so the coupling is a choice, not an accident (RT-19). |
C4 — Connector-secret dev-email hardcoding (audit 3.1) becomes irrelevant only when CLI is retired in prod | Low | Low | Cleanup folded into this epic — mcp-servers/doc-search/src/cli/make-token.ts rm'd when the worker stops needing it. |
Overview
Current Status
Drafting → red-teamed (2026-05-27). Surface defined by the audit; data-plane architecture settled; compute runtime is the open blue-team decision. Next: /hl:blue-team to pick the compute model + scope-cut to MVP, then output the requirements doc that anchors implementation.
| Capability | Status | Notes |
|---|---|---|
| Browser-to-S3 presigned upload | ⏳ designed | Bucket exists (NetworkAndData/UploadsBucket); presigner route + CORS + content-length/type constraints (RT-14) pending |
| S3-backed page/cache reads | ⏳ designed | CloudFront /api/cache/* behavior → S3 origin (Lambda out of the read path) |
| Compute worker | ❓ open (BT-1) | Fargate single-worker vs all-Lambda fan-out vs hybrid — see § Why Browser-to-S3 + SQS |
| SQS job queue | ⏳ designed | Standard + documentId idempotency (Q3); DLQ alarmed |
| Anthropic SDK extractor (replaces CLI subprocess) | ⏳ designed | Direct lift of CLI prompts; D12 prod-path. Rate-limit profile changes (BT-2) |
| Idempotency on retry | ⏳ designed | UNIQUE (document_id, content_hash) on chunks (RT-12), Phase-0 migration |
| Stuck-pending janitor | ⏳ designed (v1) | Cron Lambda re-enqueues stuck rows (RT-18) |
| Cancel/retry semantics | ⏳ designed | Per-unit, same shape as today |
| Connector-creation CLI cleanup | ⏳ designed | dev-email + dev-only helpers retired |
The Story
EPIC-4 shipped end of session 2026-05-28: auth lockdown, defense-in-depth at FU origin, multi-tenancy enforcement across all read paths. Then createKb failed in prod. The diagnostic audit pulled the thread and found 13 patterns that work locally but break in the Lambda runtime — and that the problem isn't createKb specifically, it's that the whole ingestion + file-upload arc is fundamentally cloud-incompatible.
The session also produced a category insight (feedback_lambda_cloud_gotchas.md): four structural Lambda gotchas — 6MB request cap, read-only filesystem, no CLI binaries available at runtime, fire-and-forget process death when the response returns — that any web stack built locally needs to audit BEFORE deploy. The audit gave us the surface for autri; this doc designs the architecture to address it.
EPIC-4.5 is the boundary. EPIC-5 ("onboard beta users, collect cost data") presupposes a working ingestion path; mixing the architectural rewrite into EPIC-5 would make it unscope-able.
What Is This Sub-system?
Owned by this sub-system:
- Browser-side upload UX (file picker, progress, presigned-URL fetch + S3 PUT)
- Presigned-URL minter +
documentsrow creator (a route on the Main Lambda) - Uploads bucket lifecycle (presigned PUT → notification → SQS message → worker pickup)
- Ingestion worker (Fargate task OR per-unit Lambda fan-out — compute model open, BT-1) — render, parse, structure, units, extract, embed, finalize
- Cache bucket — page images, extraction-grouping caches, /api/cache origin
- SQS job queue (+ DLQ), stuck-pending janitor
- Anthropic SDK-based extractor (replaces
spawn("claude")for the prod path)
Not owned, but interfaced with:
- RDS Postgres (workers write
pages/chunks/documents.status; web Lambdas read) - Web Lambdas (Main + Chat) — enqueue ingestion jobs + serve presigned URLs but don't ingest
mcp.autri.aiAgentCore container — read-only consumer of the chunks the worker produces- Cognito (workers don't touch auth; jobs carry
user_id+organization_idfrom the enqueuing request)
Explicit non-goals for v1:
- Real-time ingestion progress streaming to the browser (poll-based snapshot already works; SSE/WebSockets is v1.1)
- Parallel ingestion of multiple docs in one job (one doc per SQS message; within-doc concurrency runs inside the worker)
- Retry UI ("retry this failed doc") — existing per-doc cancel + re-add covers beta
- Multipart upload >100MB (RT-15 — single-PUT only for v1)
- Multi-region (single-region us-east-1, same as everything else)
Architecture
The Big Idea
Decouple the compute model from the web Lambdas. Web is a request/response surface bounded by CloudFront's 60s timeout and Lambda's 6MB / 15min / RO-FS envelope. Ingestion is minute-to-hour-scale, gigabyte-scale, filesystem-heavy. Forcing both into the same runtime is what produced the 13 audit findings. Pull ingestion out, run it in a runtime suited to long-lived stateful work, and use S3 as the durable handoff.
The browser uploads files DIRECTLY to S3 via presigned URL — the file never traverses any Lambda. The Main Lambda only mints the presign (with content-length + type constraints) and pre-creates the documents row. The worker consumes the queue, reads from S3, writes results to RDS + cache S3, and updates documents.status so the UI's existing poll-based snapshot shows progress with no UI changes.
What "the worker" is remains open (BT-1). Red-team established that the pipeline already decomposes into per-section units (D27), each finishing well under any function timeout. That dissolves the original reason to insist on a long-running Fargate process — a per-unit Lambda fan-out is equally viable and brings a better cold-start profile plus a free global rate-limit ceiling (reserved concurrency). The data-plane shape above is invariant across all three compute options; only the box that says "[compute worker]" changes. See § Why Browser-to-S3 + SQS for the comparison.
Architecture Diagram
┌────────────────────────┐
│ Browser │
└───────────┬────────────┘
│
(1) POST /api/kb/[id]/upload-url { filename, fileSize }
▼
┌───────────────────────┐
│ CloudFront (Main FU) │
└───────────┬───────────┘
▼
┌──────────────────────────────────────┐
│ Main Lambda (VPC) │
│ - D13 access check │
│ - INSERT documents (status=pending) │
│ - mint S3 PUT presign (≤100MB, │
│ content-type pinned) │
└───────────────────┬──────────────────┘
│ returns { uploadUrl, objectKey, documentId }
(2) PUT <presigned-url> + file bytes
▼
┌──────────────────────────┐
│ S3: autri-uploads/ │
│ org/<orgId>/raw/... │
└───────────┬──────────────┘
│
(3) S3 ObjectCreated event → SQS
▼
┌──────────────────────────┐
│ SQS: autri-ingest-jobs │
│ (+ DLQ, alarmed) │
└───────────┬──────────────┘
│ (4) trigger — wiring is BT-1-dependent:
│ • all-Lambda: native event-source mapping
│ • Fargate/hybrid: EventBridge Pipes → ecs:RunTask
▼
┌─────────────────────────────────────────┐
│ [ COMPUTE WORKER — open, BT-1 ] │
│ render → parse → structure → units → │
│ extract (Anthropic SDK) → embed → final │
│ resumes from MAX(pages.id) on retry; │
│ chunks deduped via UNIQUE(doc,content) │
│ ─ Fargate single task, OR │
│ ─ per-unit Lambda fan-out + counter │
└──┬────────────────────────────────┬──────┘
│ (5) page renders + caches │ (6) chunks + pages
▼ ▼ + status updates
┌──────────────────┐ ┌──────────────────┐
│ S3: autri-cache/ │ │ RDS Postgres │
│ org/<orgId>/ │ │ +pgvector │
└────────┬─────────┘ └──────────────────┘
│
(7) CloudFront /api/cache/* origin = S3 (Lambda NOT in path)
▼
┌──────────────────┐
│ Browser │ (page-image render in inspector)
└──────────────────┘
+ stuck-pending janitor (cron Lambda): re-enqueues documents stuck at
status=pending >30min — covers lost S3 events (RT-18)
System Boundary
- Uploads bucket is write-once (presigned PUT for upload), read-many (worker reads, then deletes raw file after successful ingestion). Lifecycle rule: delete after 30 days even if ingestion never completed.
- Cache bucket is the durable artifact store. Page PNGs + parse JSONs + extractor groupings. CloudFront origin for
/api/cache/*. Cache-busting via doc-slug + content-hash in object key (existingcache/<slug>/page-NN-text.jsonshape preserved, just object-stored). - SQS job queue is the only handoff between web and worker. Web never directly invokes the worker; worker never reads from anything web-shaped.
Component-by-component
S3 — uploads bucket (existing — NetworkAndData/UploadsBucket)
- Versioning: off (raw uploads immutable post-PUT; re-upload is a new object).
- Lifecycle: 30-day expiry on raw uploads (worker consumes within minutes; bucket isn't long-term storage).
- CORS: PUT from
https://app.autri.aionly. No public read. - Event notification → SQS on
ObjectCreated:*. - Object key shape:
org/<orgId>/raw/<kbSlug>/<docSlug>.<ext>— namespaced so worker IAM can be org-scoped if we ever do per-org workers.
S3 — cache bucket (new — NetworkAndData/CacheBucket)
- Versioning: off.
- Lifecycle: 90-day expiry on caches whose source doc was deleted (needs a DB-join or S3 inventory query — v1.1 janitor).
- CORS: read from
https://app.autri.ai. No public read. - CloudFront behavior
/api/cache/*routes here with cache headers. - Object key shape:
org/<orgId>/<kbSlug>/<docSlug>/page-NN-{text,paragraphs,subchunks,image}.json(mirrors existing local-FS layout — minimal worker code change).
Main Lambda — presign minter + row creator
- Route:
POST /api/kb/[kbId]/upload-urlwith{ filename, fileSize }→ returns{ uploadUrl, objectKey, documentId }. - D13-enforced: caller's
organization_idmust ownkbId. Object key is computed server-side from(orgId, kbSlug, docSlug)— client cannot influence the path. - RT-11 — creates the
documentsrow (status='pending') BEFORE returning, so the row exists for the whole lifecycle and the UI poller always finds it.documentIdis returned to the client; the worker resolves the row by object-key lookup on the S3 event. - RT-14 — presign carries hard constraints:
Content-Length-Range [0, 104857600](100MB) + a content-type pin to the declared upload type (pdf/docx allowlist). Closes the "arbitrary-bytes dump" hole at the presign, not just at downstream worker validation. Files >100MB are rejected here (RT-15 — single-PUT only for v1). - IAM:
s3:PutObjecton uploads bucket underorg/${user.organizationId}/raw/*ONLY (path-scoped condition).
SQS — autri-ingest-jobs queue
- Q3 — Standard queue +
documentIdidempotency in the message body. Worker SELECT-FOR-UPDATE ondocuments.status='pending'before processing dedupes same-doc duplicates without FIFO's throughput cap. - Visibility timeout: 60 min (matches expected longest doc).
- Retry: 3 receives before DLQ. DLQ:
autri-ingest-jobs-dlq, alarmed. Replay (RT-17): documented AWS CLI one-liner in the deploy runbook (receive-messagefrom DLQ →send-messageto main queue); tool-build deferred to v1.1. - Message shape:
{ documentId, organizationId, kbId, sourceObjectKey, sourceExt, extractorModel }— small (~300 bytes). - RT-10 — the SQS→worker trigger is BT-1-dependent and intentionally unwired here. All-Lambda = native event-source mapping; Fargate/hybrid = EventBridge Pipes (SQS→ECS target) or SQS-triggered Lambda →
ecs:RunTask.
[COMPUTE WORKER] — open decision (BT-1) The component that consumes the queue and runs render → parse → structure → units → extract → embed → finalize. Three live shapes (full comparison in § Why Browser-to-S3 + SQS):
- Fargate single task (on-demand
run-task): in-process pipeline, D27's--concurrency Nsemaphore for within-doc parallelism. Simplest fan-in (a loop). 30–60s cold start; no global Anthropic ceiling. - All-Lambda per-unit fan-out: parse/structure/units as sequential per-doc Lambdas; extract fanned out one-Lambda-per-unit with reserved concurrency as a free global rate-limit ceiling; a distributed completion counter (
UPDATE … RETURNING remaining_units, last-one-done → finalize) handles fan-in. ~2–10s cold start; ~5× per-doc cost (trivial at beta). - Hybrid: Lambda for the light stages, Fargate for extract. Best-of-both on paper; two runtimes' worth of operational surface.
Common to all: reads uploads S3, writes cache S3 + RDS, resumes from MAX(pages.id) on retry (idempotent via RT-12's unique constraint), polls documents.status for cancel between units.
- IAM (RT-5): read uploads bucket
org/*/raw/*, write cache bucketorg/*, RDS connect, Secrets Manager read on theanthropic-api-keysecret ONLY, SQS receive+delete onautri-ingest-jobsonly. No Cognito, no Lambda invoke, no other S3. Blast radius of an RCE'd worker: cross-org S3 read within uploads+cache (org-key namespacing is advisory, not IAM-enforced — a hardening item), plus RDS at the app role's level. Worth a dedicated IAM review at implementation. - VPC (RT-19): same private-with-egress subnets as web Lambdas — shares their NAT (no new $32/mo NAT) but pins the worker's network blast radius to theirs. A separate isolating VPC adds ~$32/mo per NAT; not worth it at beta. Named so the coupling is explicit.
- Resource sizing (if Fargate): 2 vCPU, 4 GB RAM — render is the memory peak (page images held while parsing). Revisit with telemetry.
Anthropic SDK extractor (replaces CLI subprocess)
- Per D12 prod-path split. The CLI extractor exists for Max-plan billing in dev; prod uses the Anthropic API (NOT Bedrock yet — that's D16, scheduled for week 3 post-cutover).
- Lift the existing
ingestion/extractor/cli-client.tsinterface; implement the Anthropic SDK side. Same prompts, same tool-use loop, same JSON output. Replacespawn("claude")+ stdout JSON envelopes withclient.messages.create({...})+ iteratetool_useblocks viastop_reason. D31 circuit breaker ports cleanly (rate-limit detection moves from stdout parsing to SDK error codes). - RT-13 — rate-limit profile CHANGES. The CLI rides Max-plan SESSION limits; the SDK rides per-API-key TIER limits (e.g. Tier 1 ≈ 50 req/min Sonnet). Action: confirm autri's API tier in the Anthropic Console — that number bounds safe concurrency. Beta risk accepted; the global-ceiling mechanism is BT-1-dependent.
Stuck-pending janitor (RT-18, v1)
- Cron Lambda (~50 LoC, EventBridge schedule every ~10 min):
SELECT id FROM documents WHERE status='pending' AND created_at < NOW() - INTERVAL '30 min'→ re-enqueue to SQS. Covers the lost-S3-event silent-failure mode and reaps orphan rows from presigns that never uploaded (no S3 object → worker marks failed).
Request Lifecycle
Upload + ingest path:
- User selects file in
KbCreateWizardorAddDocumentsDialog. - Client POSTs
/api/kb/[kbId]/upload-urlwith{ filename, fileSize }. - Main Lambda validates D13 access, creates the
documentsrow (status='pending', RT-11), computes objectKey, mints a presign with 60-min expiry +Content-Length-Range [0,100MB]+ content-type pin (RT-14), returns{ uploadUrl, objectKey, documentId }. Files >100MB are rejected here (RT-15 — single-PUT only for v1). - Client uploads the file directly to S3 via the presigned URL.
- S3 fires
ObjectCreated→ SQS message lands inautri-ingest-jobs. - Trigger fires the compute worker (mechanism is BT-1-dependent — see Component-by-component).
- Worker SELECT-FOR-UPDATEs the document row, runs the pipeline, writes results, deletes the message, deletes the raw upload.
- Browser's existing pipeline-status poller (
getPipelineSnapshot) shows progress — no UI changes. - If an S3 event is lost and the row sits at
pending>30 min, the janitor (RT-18) re-enqueues it.
Cache read path:
- Inspector renders, fetches
/api/cache/org/<orgId>/<kbSlug>/<docSlug>/page-01-image.json. - CloudFront
/api/cache/*behavior routes to the cache S3 bucket origin (NOT a Lambda). - S3 serves the cached render directly. CloudFront caches with
Cache-Control: max-age=86400. - Lambda is NOT in the read path — one of the biggest cost + latency wins of the rewrite.
Key Interfaces
| Producer | Consumer | Interface | Shape |
|---|---|---|---|
| Browser | Main Lambda | POST /api/kb/[kbId]/upload-url | { filename, fileSize } → { uploadUrl, objectKey, documentId } |
| Browser | S3 uploads | Presigned PUT (≤100MB, content-type pinned) | binary file body |
| S3 | SQS | ObjectCreated event notification | S3 event JSON |
| SQS | Compute worker | BT-1-dependent: native event-source mapping (Lambda) OR EventBridge Pipes → ecs:RunTask (Fargate/hybrid) | job message |
| Worker | RDS | INSERT/UPDATE documents, pages, chunks | Drizzle schema (+ UNIQUE(document_id, content_hash), RT-12) |
| Worker | S3 cache | PutObject | page PNGs + parse + groupings |
CloudFront /api/cache/* | S3 cache | GET origin | binary / JSON |
Build & Deploy
Build artifacts
app/(Next.js) — adds the presign + row-creation endpoint; drops local-FS write paths (env-conditional per RT-20). Otherwise unchanged.ingestion-worker/(new package) — bundles@autri/retrieval+ ingestion code + Anthropic SDK extractor. Packaged as a Fargate container image OR one-or-more Lambda functions (container/zip) per BT-1. Dockerfile mirrorsmcp-servers/doc-search/Dockerfileeither way.autri-infra— adds the ingestion constructs: SQS queue + DLQ, cache bucket, the compute target (TaskDefinition or Lambda functions per BT-1), the SQS→worker trigger, the stuck-pending janitor, IAM roles, and the chunks unique-constraint migration.
CDK provisioning vs deploy script split
- CDK provisions: SQS queues, cache bucket, the compute target + trigger (BT-1-shaped), the janitor Lambda, IAM, and the Phase-0 chunks migration. (Uploads + cache buckets already exist per
NetworkAndData.) scripts/deploy-worker.sh(new) builds + pushes the worker artifact, bumps its revision (CDK context var per D40 pattern). Shape depends on BT-1 (image push for Fargate; function update for Lambda).- Web stack's
scripts/deploy-web.shcontinues to deploy Main + Chat without touching the worker.
Deploy phasing (first-deploy bootstrap)
- Phase 0: CDK deploys new constructs (SQS queue + DLQ, IAM, cache bucket, compute-trigger wiring per BT-1) + bumps Main Lambda IAM to mint presigns + create rows. Migration:
ALTER TABLE chunks ADD CONSTRAINT chunks_doc_content_uniq UNIQUE (document_id, content_hash)(RT-12). Deploy the stuck-pending janitor (RT-18). - Phase 1: Build + push the worker artifact (Fargate image or Lambda container/zip per BT-1); register it. ECS/Lambda reads it on next invocation.
- Phase 2: Migrate
/api/cacheCloudFront behavior to S3 origin (one CloudFront invalidation; cache bucket is empty so cold loads regenerate from worker output). - Phase 3: Cut Main Lambda's
stageFiles+ fire-and-forgetrunIngestionPipeline; replace withenqueueIngestion(documentId)(insert row + write SQS message), behindINGESTION_RUNTIME=worker|inline(Q6).
Each phase is independently deployable + reversible. The bootstrap chicken-and-egg (TaskDefinition/Lambda needs an image URI that doesn't exist until built) follows the existing Main Lambda placeholder pattern in lib/web/lambdas.ts:208-222 (RT-6).
Rollback Strategy
- TaskDefinition rollback: point ECS at previous TaskDefinition revision (single AWS call).
- Code-path rollback: Phase 3's "enqueue vs runInline" is feature-flagged via env var
INGESTION_RUNTIME=worker|inlineon Main Lambda. Flip back to inline disables the queue path (until we cut the inline code permanently in a follow-up). - Data rollback: none required — schema is unchanged; cache S3 is regeneratable from raw uploads or by re-ingesting.
Cost Shape
Idle (zero traffic)
- SQS queue: ~$0 (charged per request; idle = zero requests).
- Cache S3: $0.023/GB-month. 10 GB beta cache = $0.23/mo.
- Uploads S3: $0.023/GB-month, but lifecycle expires raw after 30 days. Steady state ~5 GB = $0.12/mo.
- ECR worker image (Fargate/hybrid): $0.10/GB-month per repo. ~1 GB compressed = $0.10/mo. (All-Lambda container images: same order.)
- Compute: $0 idle under both live options (on-demand
run-taskor event-driven Lambda). - EventBridge: $1.00 per million events; idle = ~$0.
- CloudWatch Logs: $0.50/GB ingested; idle = ~$0. (Retention set to 30 days — see Cross-Cutting > Observability.)
Total idle add-on: ~$1/mo on top of existing W3 idle floor.
Beta load (10 docs/day across 10 users)
Compute (the BT-1 decision drives this):
| Compute model | Per-doc | Monthly (~300 docs) |
|---|---|---|
| Fargate on-demand (2 vCPU, 4 GB, ~5 min/doc) | ~$0.008–0.01 | ~$3–8/mo |
| All-Lambda per-unit fan-out | ~$0.04 | ~$12/mo |
Lambda costs ~5× more per doc because it bills GB-seconds for LLM-I/O wait across every concurrent unit invocation; Fargate's flat task rate doesn't multiply with internal concurrency. The ~$4–9/mo delta is noise next to the LLM bill below.
Rest of the data plane (compute-independent):
- S3 storage: ~20 GB cache + uploads ≈ $0.50/mo.
- S3 PUT/GET: ~$1/mo.
- CloudFront egress for
/api/cache/*: ~30 GB/mo (10 docs × ~50 pages × ~200 KB renders × 30 days) × $0.085/GB ≈ ~$2.50/mo. - NAT data egress (Anthropic + OpenAI): ~10 GB/mo ≈ $0.45/mo (uses the existing shared NAT; D16's Bedrock-via-VPC-endpoint would skip NAT later).
- SQS + EventBridge + CloudWatch: ~$1/mo.
Total data-plane add-on (excluding LLM): ~$8–15/mo on top of the W3 beta floor, depending on compute model.
LLM extraction (the dominant axis, architecture-independent): ~$0.005/chunk × ~300 chunks/doc × 300 docs/mo ≈ ~$450/mo. This is ~30–45× the infrastructure cost — every architecture decision in this doc is rounding error against it (per D18).
Hypothetical 1k MAU
- Compute scales linearly with doc throughput; the Fargate-vs-Lambda per-doc gap widens to ~$400–900/mo at ~30k docs/mo (the ~5× premium becomes material) — but both are still dwarfed by the ~$45k/mo Anthropic spend, and the hot path would be re-optimized before then. This is where BT-1's cost axis finally matters; at beta it doesn't.
- The bottleneck becomes Anthropic API quota long before ECS task / Lambda concurrency limits.
- S3 storage grows with cache footprint; if it crosses ~1 TB, consider Glacier IA transitions on caches older than 90 days.
- SQS scales without intervention; FIFO would cap throughput at 300 msg/s per group but we'd be Standard.
Unit Economics & Cost Model
The only material cost is the LLM extract call. Every other stage — render, parse, structure, units, load, embed, link, finalize — is CPU/IO or a cheap embeddings call (~$0.005/doc for OpenAI text-embedding-3-small); combined they're fractions of a cent per doc. Infra (compute pattern, S3, SQS, CloudFront) adds ~$8–15/mo regardless. So unit economics ≈ the cost of the per-section Haiku call, full stop — and the compute-pattern decision (BT-1) is cost-neutral on this axis.
Validation stage (resolved 2026-05-28): planned to be cut or refactored to a non-LLM heuristic gate, so it does NOT add a second model pass. (If that reverses and validation becomes a real per-page/per-section agent re-check, extract cost roughly doubles — flagged so the decision stays visible.)
Measured cost (Dan's runs, via the CLI agentic loop):
| Source type | Per section (unit) | Per chunk | Why |
|---|---|---|---|
| PDF / vision (Genesis: 45 units → 371 chunks, $3.36) | ~$0.075 | ~$0.009 | page images attached to the call |
| Prose / text (novel: 26 chapters → 312 chunks, $0.60) | ~$0.023 | ~$0.002 | text-only atoms, fewer LLM turns |
The natural billing atom is the section (unit) — one LLM call per section, producing ~8–12 chunks. The 4–5× PDF-vs-prose gap is the page images.
Cost levers (cheapest first):
- CLI → SDK direct (already the EPIC-4.5 plan, Q4): kills the agentic Read-tool round-trips. Dan's D27 estimate: −30–50% → PDF ~$0.005, prose ~$0.001/chunk. Free; happening anyway.
- Prompt caching (D16 Y1 must-ship): the grouping prompt is static across every section of every doc; cached input bills ~10%. Net ~10–20% off, helps prose more (no images in input), and grows with volume as the cache stays warm across docs.
- Batch API (−50%) — situational: ingestion is async, but the user watches a progress bar, so 24h batch latency only fits a future "bulk import, come back later" mode — not the interactive path.
- Model choice / self-host — premature: vision is the blocker for self-hosting (PDF needs a capable vision model; a Mac mini can't serve one at throughput), and extraction quality is the product. Graduated path: API+Haiku → caching → Bedrock (D16) → maybe a cheaper/fine-tuned text model for prose grouping at high volume (vision keeps the capable model). Economics only flip at sustained high volume, on cloud GPU — not at beta.
The profitability shape — two facts that reframe the D18 worry:
-
Extraction is ONE-TIME; revenue is RECURRING. A user ingests once, then queries for months. At SDK+prose rates a typical manuscript (~300 chunks) is ~$0.45 one-time vs $10×N months. Even a cap-saturating author (30k chunks) is ~$45 one-time — profitable past ~5 months retention. The only loss case is saturate-the-cap-then-churn-in-month-1.
-
Chunks-stored ≠ chunks-extracted. A stored chunk is a cheap row + a ~6KB vector; 30k chunks ≈ 180 MB — near-free to store, cheap to query. The cost is the one-time extraction event, not the resting count. D18's single "chunks" axis conflates a value/storage axis with a cost axis.
Implication for pricing (refines D18; validate in EPIC-5): the cost driver is new extraction volume per month, not total chunks stored. This reconciles the competitive tension Dan raised — storage is cheap to give away, so we can offer roomy KB caps to stay competitive while metering the thing that actually costs money: ingestion throughput. Candidate model: generous stored-chunk caps (value axis) + a monthly new-ingestion budget or metered overage (cost axis), rather than one conflated chunk cap.
Open worry (Dan, 2026-05-28): users may upload aggressively; stingy storage caps could push them to a competitor. The mitigant above (generous storage, metered ingestion) is the working hypothesis — but real upload behavior is unknown until beta. Tracked for EPIC-5 cost telemetry; do not lock pricing before then.
Failure Modes
Worker container fails to pull image (Fargate/hybrid). Trigger has retry-on-failure; after 3 fails the SQS message goes to DLQ + CloudWatch alarm. Recovery: fix ECR repo policy or image tag; replay DLQ (runbook one-liner, RT-17).
Anthropic API rate-limit during extraction. Circuit breaker (D31) trips, worker exits cleanly. SQS visibility timeout expires, message redelivers up to 3× before DLQ. Workers across docs are independent. Note (RT-16): there is no GLOBAL concurrency ceiling under Fargate (per-process semaphore only) — under all-Lambda, reserved concurrency provides one for free. A distributed token bucket is the v1.1 answer if Fargate wins BT-1 and volume climbs.
Worker crashes mid-extraction (OOM, container restart, task abort). Message returns to queue after visibility timeout. On retry, worker reads documents.status + pages already written; resumes from MAX(pages.id). Idempotent because chunks carry a UNIQUE (document_id, content_hash) constraint (RT-12) — re-processed units upsert rather than duplicate.
Browser-S3 upload fails partway. v1 is single-PUT (≤100MB); a failed PUT just means the client retries the whole upload. No partial state in the bucket (multipart deferred to v1.1, RT-15).
SQS message landed but never picked up (capacity exhausted). Visibility timeout + retry; cluster/concurrency scales out. CloudWatch alarm on queue depth >100 for >5 min.
Worker writes to RDS during Multi-AZ failover. Drizzle connection pool re-establishes; worker retries the transaction. SQS visibility timeout + retry covers anything that fails outright.
S3 event → SQS notification dropped. S3 guarantees at-least-once, so this effectively doesn't happen — but if it does, the documents.status='pending' row is reaped by the stuck-pending janitor (RT-18, v1), which re-enqueues rows >30 min old. This is the safety net that makes the lost-event case non-fatal.
Cache bucket misconfigured (CORS reject) — pages don't load in inspector. One CORS-fix + CloudFront invalidation is the recovery. Origin failover not in scope for v1.
Single-region (us-east-1) outage. Whole-beta outage. Explicitly accepted for a ≤10-user beta; multi-region is out of scope.
Why Browser-to-S3 + SQS, With Compute Runtime Left Open
The DATA plane is settled; the COMPUTE plane is the open blue-team decision.
Settled (the data plane): browser uploads directly to S3 via presigned PUT → S3 ObjectCreated event → SQS → [compute worker] → RDS + cache S3 → CloudFront serves /api/cache/* from S3. This bypasses the Lambda 6MB request cap and read-only filesystem in one stroke, regardless of which compute model consumes the queue. No alternative to this shape survived red-team — it's the standard, well-understood pattern.
Open (the compute plane): what consumes the SQS message and runs render → parse → structure → units → extract → embed → finalize. Red-team (2026-05-27) demoted the original "Fargate worker" thesis from chosen to one of three live options, because the pipeline already decomposes into per-section units (D27) that each finish well under any function timeout — which dissolves the 15-min-cap argument that was the main reason to prefer a long-running worker.
Live compute options (blue-team decides — BT-1)
| Option | Idle cost | Cold start | Per-doc cost (beta) | Anthropic rate-limit control | Added complexity | Verdict |
|---|---|---|---|---|---|---|
Fargate single worker (on-demand run-task) | ~$1/mo | 30–60s task launch | Per-process semaphore (D27); no global ceiling — cross-task collision is a v1.1 distributed-bucket problem | Lowest — in-process loop, one container, fan-in is trivial | Live — simplest, but worst cold-start + no free global rate limit | |
| All-Lambda (per-unit extract fan-out) | ~$0/mo | ~2–10s (container Lambda) | Reserved concurrency = free global ceiling (cap extract-Lambda at N; SQS holds the rest) | Highest — fan-in needs a distributed completion counter (UPDATE … RETURNING remaining_units, last-unit-done → finalize) | Live — best cold-start + free rate limit; ~5× per-doc cost (trivial at beta); most coordination complexity | |
| Hybrid (Lambda for parse/structure/units/embed/finalize, Fargate for extract) | ~$1/mo | mixed | between the two | Fargate semaphore for extract; same v1.1 gap | Two runtimes, two deploy pipelines, two cold-start profiles | Live — best-of-both on paper, doubles operational surface |
Why Lambda costs ~5× more per doc: extract is LLM-I/O-bound — the function bills GB-seconds while waiting on Haiku, multiplied across every concurrent unit invocation. Fargate's flat task rate doesn't multiply with internal concurrency. The premium is ~$4–9/mo at beta but climbs to ~$400–900/mo at 1k MAU — still dwarfed by the ~$45k/mo Anthropic spend at that volume, and the hot path would be re-optimized before then.
Coupled decision (RT-10): the SQS→worker trigger depends on this choice. All-Lambda uses a native SQS event-source mapping (no EventBridge, no ecs:RunTask). Fargate/hybrid use EventBridge Pipes (SQS→ECS target) or an SQS-triggered Lambda calling ecs:RunTask. Don't wire this until the compute model is picked.
Rejected outright (did not survive red-team)
| Option | Why rejected |
|---|---|
| AgentCore Tools | AgentCore is for agent-driven invocations; ingestion is a fixed pipeline, not agent-shaped. Forcing it over-couples to the MCP runtime. The compound-benefit argument (project_autri_mcp_wedge.md) doesn't apply — ingestion writes the chunks the MCP reads, but isn't itself an MCP concern. |
| Step Functions orchestration | Adds a state-machine DSL + per-transition cost (~$3/mo beta, superlinear). The DB already provides stage observability via documents.status. Caveat surfaced by red-team: if all-Lambda fan-out wins BT-1, Step Functions Map is the natural fan-in orchestrator and its cost deserves a fresh look at that point — but as a standalone v1 orchestration layer it's rejected. |
| Single monolithic Lambda + extended timeout | The 15-min hard cap doesn't cover a whole-doc extract (~30 min). NB: this is the monolithic framing — the per-unit-fan-out Lambda option above is different and is NOT rejected. |
| Worker inside the existing MCP container | Couples ingestion failure mode to MCP availability; ingestion concurrency blows the MCP latency budget (<200ms p99). The MCP container is a query surface, not a compute surface. |
Related Epics
| Epic | Relationship |
|---|---|
| EPIC-4 (AWS deploy) | Predecessor. W3 web stack + AgentCore Runtime + Cognito all in place. EPIC-4.5 fits into the existing CDK app structure. |
| EPIC-5 (beta onboarding + cost data) | Successor. Cost telemetry (per-org Anthropic spend, Fargate-hour usage) gets surfaced in EPIC-5's cost dashboard. |
| D16 (Bedrock for prod LLM) | Future swap-in. Worker's cli-client.ts-shaped extractor abstraction is the seam — Anthropic SDK now, Bedrock later, single-file change. |
| D44 (Connector backend API-route) | Independent. D44's connector creation still uses Main Lambda; not touched by this sub-system. |
Cross-Cutting Concerns
| Concern | How this sub-system handles it |
|---|---|
| D13 multi-tenancy | Worker reads organizationId from the job context (set by Main Lambda at row-creation/enqueue); all DB writes scoped via knowledge_bases.organization_id. Cache S3 keys are org-namespaced. |
| D19 MCP-as-infrastructure | Worker writes the same chunks table the MCP server reads. Inspector hash-anchors still work post-ingestion. No MCP-side change. |
| D38 unified agent surface | No effect. Worker only writes; agent surfaces read. |
| Observability | Worker emits structured stdout JSON → CloudWatch Logs (same pattern as the MCP container's audit log). Log schema: { ts, severity, component, documentId, organizationId, stage, durationMs, message }. Per-doc / per-unit / per-LLM-call spans. Retention: 30 days. Alarms: SQS depth >100 for >5 min; DLQ count >0; task/invocation failures >0/hr; worker duration >1 hr (likely stuck). v1.1: trace IDs propagated from web request through SQS. |
| Cost control | LLM spend is the dominant axis (D18); infra is rounding error. Worker tracks per-doc extraction_cost_cents (new column — plumbed in EPIC-5, not here). API-tier dependency tracked as BT-2. |
| D47 server-action ID drift | /api/kb/[kbId]/upload-url is an API route, not a server action — survives across deploys by construction. |
Decisions Log
| Date | Decision | Status | Rationale |
|---|---|---|---|
| 2026-05-27 | Data plane: Browser-to-S3 presigned + SQS + [compute] → RDS + cache S3 | Settled (survived red-team) | Decouples ingestion from web Lambdas; bypasses 6MB cap + RO-FS regardless of compute model. Standard pattern. |
| 2026-05-27 → resolved 2026-05-28 | Compute runtime → all-Lambda per-unit fan-out | RESOLVED (BT-1) | Blue-team chose all-Lambda over Fargate/hybrid: serverless-native once (no later migration) + free global rate ceiling via reserved concurrency; accepted fan-in complexity. Contract: requirements/cloud-native-ingestion. |
| 2026-05-27 | RT-11 — Main Lambda creates the documents row at presign time (status='pending'); S3 ObjectCreated event triggers the worker, which resolves the row by object key | Settled | Row exists for the full lifecycle so the UI poller always finds it. Orphan-pending rows (presign without upload) reaped by the stuck-pending janitor. |
| 2026-05-27 | RT-12 — Add UNIQUE (document_id, content_hash) to chunks | Settled (Phase-0 migration) | Makes "resume from MAX(pages.id) on retry" idempotent — partial chunks from a crashed extract dedupe on re-run rather than duplicate. Also underpins the fan-in marker design. |
| 2026-05-27 | RT-14 — Presigned URL enforces Content-Length-Range [0,100MB] + content-type allowlist (pdf/docx) | Settled (beta-blocker) | Without it, any signed-in user dumps arbitrary bytes into the uploads bucket. Constraint lives in the presign, not just downstream. |
| 2026-05-27 | RT-15 — Cap uploads at 100MB single-PUT for v1; defer multipart to v1.1 | Settled | STEM Racing's largest file ~9MB. Drops a day of multipart-presign complexity (Initiate + per-part + Complete). |
| 2026-05-27 | RT-18 — Stuck-pending janitor ships in v1 | Settled | ~50 LoC cron Lambda re-enqueues status='pending' AND created_at < NOW()-30min. Also detects stuck-ingesting docs (the beta "stuck worker" alarm). |
| 2026-05-27 | RT-20 — Env-conditional at each I/O boundary; prod writes S3, dev keeps local FS | Settled | if (env.CACHE_BUCKET) putS3() else writeLocal() at each of the ~5-6 write sites. No new abstraction layer; corrects the earlier "single-file switch" overstatement. |
| 2026-05-27 | Q3 — Standard SQS + documentId idempotency (not FIFO) | Holds | Higher throughput, lower cost; worker SELECT-FOR-UPDATE handles same-doc duplicates. |
| 2026-05-27 | Q4 — Anthropic SDK (not Bedrock) for extraction in v1 | Holds, with RT-13 caveat | Per D12 split. Rate-limit profile CHANGES (per-key tier, not Max session) — see BT-2. D16 Bedrock swap week 3. |
| 2026-05-27 | Q5 — Single shared cache S3 bucket, org-namespaced keys | Holds | Simplest IAM; 100-bucket-per-account default limit kills per-org buckets. |
| 2026-05-27 | Q6 — Feature-flag inline→worker cutover (INGESTION_RUNTIME) | Holds | Phase-3 rollback = flip to inline. |
| 2026-05-28 | RT-10 — SQS→worker trigger = native SQS event-source mapping | Resolved (follows BT-1) | All-Lambda uses native SQS event-source mappings on all three queues; no EventBridge / ecs:RunTask. |
| 2026-05-28 | Prompt caching deferred to v1.1; observability = logs + 2 alarms | Settled (blue-team) | SDK-direct already gives −30–50%; measure before optimizing further. Beta observability scoped to any-DLQ >0 + stuck-ingesting alarm. |
Open Questions for Blue-Team
Blue-team ran 2026-05-28. BT-1 is resolved; the implementation contract is projects/autri/requirements/cloud-native-ingestion.
BT-1 — RESOLVED → all-Lambda per-unit fan-out. Chosen over Fargate single-worker and hybrid: serverless-native once (no Fargate→Lambda migration later) + reserved concurrency gives a free global Anthropic rate-limit ceiling. Accepted cost: fan-in coordination — idempotent per-unit markers (document_units UNIQUE) + a single-winner finalize CAS guard, specced in the requirements doc. Prompt caching deferred to v1.1; observability scoped to structured logs + 2 alarms (any-DLQ >0; doc stuck ingesting >40 min).
BT-2 — Action item (open). Confirm autri's Anthropic API tier in the Console; it sets the Extract Lambda's reserved concurrency (the rate ceiling). Blocks Phase-1 sizing only, not Phase 0.
v1.1 deferrals (unchanged): multipart >100MB (RT-15), distributed token bucket (RT-16 — moot under all-Lambda unless tier ceilings change), SSE progress, retry UI, mid-LLM-call cancel, cache-bucket janitor, prompt caching, render fan-out per-page-range.
Carried-forward awareness items: per-doc extraction_cost_cents (EPIC-5); org-key S3 namespacing is advisory, not IAM-enforced (RT-5 hardening).
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| Local-FS-vs-S3 handled by env-conditional at each of ~5-6 write sites (RT-20) | Low | if (env.CACHE_BUCKET) putS3() else writeLocal() in parse / render / parse-docx / extractor / stage-files + /api/cache. Corrects the earlier "single-file switch" overstatement — it's a per-site boundary check, not one switch. No new abstraction layer (deliberate). |
Worker image cold-start (Fargate) dominated by pnpm deploy --prod bundle (~1 GB) | Medium | Layer caching helps repeat deploys. Source the real number from the mcp.autri.ai container's measured cold start (D39) rather than the "30-60s" estimate. Pre-compile to dist/ if telemetry shows it matters. Moot if all-Lambda wins (BT-1). |
| Multipart upload for files >100MB deferred to v1.1 (RT-15) | Low | Single-PUT covers all known beta files (~9MB max). Add Initiate/per-part/Complete presign API only when a real user hits the cap. |
| Distributed Anthropic token bucket deferred to v1.1 (RT-16) | Low | Only needed if Fargate wins BT-1 AND concurrent-doc volume climbs. Free under all-Lambda via reserved concurrency. |
Per-doc cost tracking needs a documents.extraction_cost_cents column | Low | EPIC-5 cost dashboard depends on it; plumbed there, not here. |
| Cache-bucket janitor for deleted-doc caches (90-day) | Low | Needs a DB-join or S3 inventory query. v1.1. |
| Org-key namespacing on cache/uploads is advisory, not IAM-enforced per-org | Medium | A single worker role can read any org's keys. Fine for a single-worker beta; per-org IAM scoping is a hardening item if the RCE blast radius (RT-5) becomes a concern. |
| No retry UI for failed docs | Low | Dan's existing "delete + re-upload" flow covers beta. v1.1 adds a single "retry" button that re-enqueues the SQS message. |
| Mid-LLM-call cancel (vs mid-unit cancel) | Low | D27 already deferred this; the worker boundary doesn't change the answer. Per-unit cancel granularity is fine for beta. |
Red-teamed 2026-05-27. Compute runtime (BT-1) is the open blue-team decision; data plane settled. Next: /hl:blue-team.