projects/autri/archive/sub-systems/cloud-native-ingestion

Sub-system design doc for autri's ingestion + file-upload pipeline. Scopes EPIC-4.5, the carve-out between EPIC-4 (AWS deploy) and EPIC-5 (beta onboarding). Driven by the 2026-05-28 cloud-native audit which found 13 patterns that work locally but break on AWS Lambda (6MB request cap, read-only filesystem, no CLI binaries, fire-and-forget process death). Captured in cross-project memory as feedback_lambda_cloud_gotchas.md.

Drafted 2026-05-27 end-of-session, EPIC-4 P0 finish complete. Intentionally takes positions to give the next session's /hl:red-team pass real targets to attack. Working thesis: Browser-to-S3 presigned upload + SQS queue + Fargate worker. Three alternatives explicitly rejected; load-bearing decisions flagged in § Open Questions for Red-Team.

Risks & Constraints

Risk / Constraint	Likelihood	Impact	Mitigation
Compute-model cost surprises beta unit-economics	Low	Low	Both live options (Fargate ~$3–8/mo, all-Lambda ~$12/mo) are noise vs the ~$450/mo LLM bill. Budgets alarm regardless.
Worker cold start adds lag to first-doc-of-day	Medium	Low	Fargate ~30–60s; all-Lambda ~2–10s. Acceptable at beta either way; source the real Fargate number from the `mcp.autri.ai` container (D39) before assuming.
Worker crashes mid-extraction lose progress	High	Low	Resume from `MAX(pages.id)`; idempotent via `UNIQUE(document_id, content_hash)` (RT-12).
Cancel propagation across worker boundary is racy	Medium	Medium	Worker polls `documents.status` between units; SELECT-FOR-UPDATE rejects re-enqueued duplicates. Mid-unit kill is v1.1.
Lost S3 event leaves doc stuck at `pending`	Low	Medium	Stuck-pending janitor (RT-18, v1) re-enqueues rows >30 min old.
Presigned-URL abuse (arbitrary-bytes dump)	Medium	High	Presign enforces `Content-Length-Range [0,100MB]` + content-type pin (RT-14).
C1 — Anthropic rate limits hit harder without Max-plan cushion	Medium	Medium	Profile CHANGES CLI→SDK (per-key tier, not session — RT-13). Confirm tier in Console (BT-2). Global ceiling is free under all-Lambda (reserved concurrency); v1.1 distributed bucket under Fargate. Circuit breaker (D31) already exists.
C2 — Worker IAM creep ("just give it admin")	Medium	High	Scoped to: read uploads bucket, write cache bucket, RDS connect, Secrets Manager read on the Anthropic key only, SQS receive/delete on one queue. No Cognito, no other S3, no Lambda invoke. Dedicated IAM review at implementation (RT-5).
C3 — Shared VPC pins worker blast radius to web Lambdas	Low	Low	Same subnets + NAT as web Lambdas (no new $32/mo NAT). Isolation would cost ~$32/mo per separate NAT — not worth it at beta. Named so the coupling is a choice, not an accident (RT-19).
C4 — Connector-secret `dev-email` hardcoding (audit 3.1) becomes irrelevant only when CLI is retired in prod	Low	Low	Cleanup folded into this epic — `mcp-servers/doc-search/src/cli/make-token.ts` rm'd when the worker stops needing it.

Overview

Current Status

Drafting → red-teamed (2026-05-27). Surface defined by the audit; data-plane architecture settled; compute runtime is the open blue-team decision. Next: /hl:blue-team to pick the compute model + scope-cut to MVP, then output the requirements doc that anchors implementation.

Capability	Status	Notes
Browser-to-S3 presigned upload	⏳ designed	Bucket exists (`NetworkAndData/UploadsBucket`); presigner route + CORS + content-length/type constraints (RT-14) pending
S3-backed page/cache reads	⏳ designed	CloudFront `/api/cache/*` behavior → S3 origin (Lambda out of the read path)
Compute worker	❓ open (BT-1)	Fargate single-worker vs all-Lambda fan-out vs hybrid — see § Why Browser-to-S3 + SQS
SQS job queue	⏳ designed	Standard + `documentId` idempotency (Q3); DLQ alarmed
Anthropic SDK extractor (replaces CLI subprocess)	⏳ designed	Direct lift of CLI prompts; D12 prod-path. Rate-limit profile changes (BT-2)
Idempotency on retry	⏳ designed	`UNIQUE (document_id, content_hash)` on chunks (RT-12), Phase-0 migration
Stuck-pending janitor	⏳ designed (v1)	Cron Lambda re-enqueues stuck rows (RT-18)
Cancel/retry semantics	⏳ designed	Per-unit, same shape as today
Connector-creation CLI cleanup	⏳ designed	dev-email + dev-only helpers retired

The Story

EPIC-4 shipped end of session 2026-05-28: auth lockdown, defense-in-depth at FU origin, multi-tenancy enforcement across all read paths. Then createKb failed in prod. The diagnostic audit pulled the thread and found 13 patterns that work locally but break in the Lambda runtime — and that the problem isn't createKb specifically, it's that the whole ingestion + file-upload arc is fundamentally cloud-incompatible.

The session also produced a category insight (feedback_lambda_cloud_gotchas.md): four structural Lambda gotchas — 6MB request cap, read-only filesystem, no CLI binaries available at runtime, fire-and-forget process death when the response returns — that any web stack built locally needs to audit BEFORE deploy. The audit gave us the surface for autri; this doc designs the architecture to address it.

EPIC-4.5 is the boundary. EPIC-5 ("onboard beta users, collect cost data") presupposes a working ingestion path; mixing the architectural rewrite into EPIC-5 would make it unscope-able.

What Is This Sub-system?

Owned by this sub-system:

Browser-side upload UX (file picker, progress, presigned-URL fetch + S3 PUT)
Presigned-URL minter + documents row creator (a route on the Main Lambda)
Uploads bucket lifecycle (presigned PUT → notification → SQS message → worker pickup)
Ingestion worker (Fargate task OR per-unit Lambda fan-out — compute model open, BT-1) — render, parse, structure, units, extract, embed, finalize
Cache bucket — page images, extraction-grouping caches, /api/cache origin
SQS job queue (+ DLQ), stuck-pending janitor
Anthropic SDK-based extractor (replaces spawn("claude") for the prod path)

Not owned, but interfaced with:

RDS Postgres (workers write pages / chunks / documents.status; web Lambdas read)
Web Lambdas (Main + Chat) — enqueue ingestion jobs + serve presigned URLs but don't ingest
mcp.autri.ai AgentCore container — read-only consumer of the chunks the worker produces
Cognito (workers don't touch auth; jobs carry user_id + organization_id from the enqueuing request)

Explicit non-goals for v1:

Real-time ingestion progress streaming to the browser (poll-based snapshot already works; SSE/WebSockets is v1.1)
Parallel ingestion of multiple docs in one job (one doc per SQS message; within-doc concurrency runs inside the worker)
Retry UI ("retry this failed doc") — existing per-doc cancel + re-add covers beta
Multipart upload >100MB (RT-15 — single-PUT only for v1)
Multi-region (single-region us-east-1, same as everything else)

Architecture

The Big Idea

Decouple the compute model from the web Lambdas. Web is a request/response surface bounded by CloudFront's 60s timeout and Lambda's 6MB / 15min / RO-FS envelope. Ingestion is minute-to-hour-scale, gigabyte-scale, filesystem-heavy. Forcing both into the same runtime is what produced the 13 audit findings. Pull ingestion out, run it in a runtime suited to long-lived stateful work, and use S3 as the durable handoff.

The browser uploads files DIRECTLY to S3 via presigned URL — the file never traverses any Lambda. The Main Lambda only mints the presign (with content-length + type constraints) and pre-creates the documents row. The worker consumes the queue, reads from S3, writes results to RDS + cache S3, and updates documents.status so the UI's existing poll-based snapshot shows progress with no UI changes.

What "the worker" is remains open (BT-1). Red-team established that the pipeline already decomposes into per-section units (D27), each finishing well under any function timeout. That dissolves the original reason to insist on a long-running Fargate process — a per-unit Lambda fan-out is equally viable and brings a better cold-start profile plus a free global rate-limit ceiling (reserved concurrency). The data-plane shape above is invariant across all three compute options; only the box that says "[compute worker]" changes. See § Why Browser-to-S3 + SQS for the comparison.

Architecture Diagram

                            ┌────────────────────────┐
                            │     Browser            │
                            └───────────┬────────────┘
                                        │
                  (1) POST /api/kb/[id]/upload-url  { filename, fileSize }
                                        ▼
                            ┌───────────────────────┐
                            │  CloudFront (Main FU) │
                            └───────────┬───────────┘
                                        ▼
                  ┌──────────────────────────────────────┐
                  │  Main Lambda (VPC)                    │
                  │  - D13 access check                   │
                  │  - INSERT documents (status=pending)  │
                  │  - mint S3 PUT presign (≤100MB,       │
                  │      content-type pinned)             │
                  └───────────────────┬──────────────────┘
                                        │ returns { uploadUrl, objectKey, documentId }
                  (2) PUT <presigned-url> + file bytes
                                        ▼
                            ┌──────────────────────────┐
                            │  S3: autri-uploads/      │
                            │   org/<orgId>/raw/...    │
                            └───────────┬──────────────┘
                                        │
                  (3) S3 ObjectCreated event → SQS
                                        ▼
                            ┌──────────────────────────┐
                            │  SQS: autri-ingest-jobs  │
                            │   (+ DLQ, alarmed)       │
                            └───────────┬──────────────┘
                                        │  (4) trigger — wiring is BT-1-dependent:
                                        │    • all-Lambda: native event-source mapping
                                        │    • Fargate/hybrid: EventBridge Pipes → ecs:RunTask
                                        ▼
                  ┌─────────────────────────────────────────┐
                  │  [ COMPUTE WORKER — open, BT-1 ]         │
                  │  render → parse → structure → units →    │
                  │  extract (Anthropic SDK) → embed → final │
                  │  resumes from MAX(pages.id) on retry;    │
                  │  chunks deduped via UNIQUE(doc,content)  │
                  │   ─ Fargate single task, OR              │
                  │   ─ per-unit Lambda fan-out + counter    │
                  └──┬────────────────────────────────┬──────┘
                     │ (5) page renders + caches       │ (6) chunks + pages
                     ▼                                  ▼  + status updates
        ┌──────────────────┐                ┌──────────────────┐
        │ S3: autri-cache/ │                │  RDS Postgres    │
        │  org/<orgId>/    │                │  +pgvector       │
        └────────┬─────────┘                └──────────────────┘
                 │
        (7) CloudFront /api/cache/* origin = S3 (Lambda NOT in path)
                 ▼
        ┌──────────────────┐
        │     Browser      │ (page-image render in inspector)
        └──────────────────┘

    + stuck-pending janitor (cron Lambda): re-enqueues documents stuck at
      status=pending >30min — covers lost S3 events (RT-18)

System Boundary

Uploads bucket is write-once (presigned PUT for upload), read-many (worker reads, then deletes raw file after successful ingestion). Lifecycle rule: delete after 30 days even if ingestion never completed.
Cache bucket is the durable artifact store. Page PNGs + parse JSONs + extractor groupings. CloudFront origin for /api/cache/*. Cache-busting via doc-slug + content-hash in object key (existing cache/<slug>/page-NN-text.json shape preserved, just object-stored).
SQS job queue is the only handoff between web and worker. Web never directly invokes the worker; worker never reads from anything web-shaped.

Component-by-component

S3 — uploads bucket (existing — NetworkAndData/UploadsBucket)

Versioning: off (raw uploads immutable post-PUT; re-upload is a new object).
Lifecycle: 30-day expiry on raw uploads (worker consumes within minutes; bucket isn't long-term storage).
CORS: PUT from https://app.autri.ai only. No public read.
Event notification → SQS on ObjectCreated:*.
Object key shape: org/<orgId>/raw/<kbSlug>/<docSlug>.<ext> — namespaced so worker IAM can be org-scoped if we ever do per-org workers.

S3 — cache bucket (new — NetworkAndData/CacheBucket)

Versioning: off.
Lifecycle: 90-day expiry on caches whose source doc was deleted (needs a DB-join or S3 inventory query — v1.1 janitor).
CORS: read from https://app.autri.ai. No public read.
CloudFront behavior /api/cache/* routes here with cache headers.
Object key shape: org/<orgId>/<kbSlug>/<docSlug>/page-NN-{text,paragraphs,subchunks,image}.json (mirrors existing local-FS layout — minimal worker code change).

Main Lambda — presign minter + row creator

Route: POST /api/kb/[kbId]/upload-url with { filename, fileSize } → returns { uploadUrl, objectKey, documentId }.
D13-enforced: caller's organization_id must own kbId. Object key is computed server-side from (orgId, kbSlug, docSlug) — client cannot influence the path.
RT-11 — creates the documents row (status='pending') BEFORE returning, so the row exists for the whole lifecycle and the UI poller always finds it. documentId is returned to the client; the worker resolves the row by object-key lookup on the S3 event.
RT-14 — presign carries hard constraints: Content-Length-Range [0, 104857600] (100MB) + a content-type pin to the declared upload type (pdf/docx allowlist). Closes the "arbitrary-bytes dump" hole at the presign, not just at downstream worker validation. Files >100MB are rejected here (RT-15 — single-PUT only for v1).
IAM: s3:PutObject on uploads bucket under org/${user.organizationId}/raw/* ONLY (path-scoped condition).

SQS — autri-ingest-jobs queue

Q3 — Standard queue + documentId idempotency in the message body. Worker SELECT-FOR-UPDATE on documents.status='pending' before processing dedupes same-doc duplicates without FIFO's throughput cap.
Visibility timeout: 60 min (matches expected longest doc).
Retry: 3 receives before DLQ. DLQ: autri-ingest-jobs-dlq, alarmed. Replay (RT-17): documented AWS CLI one-liner in the deploy runbook (receive-message from DLQ → send-message to main queue); tool-build deferred to v1.1.
Message shape: { documentId, organizationId, kbId, sourceObjectKey, sourceExt, extractorModel } — small (~300 bytes).
RT-10 — the SQS→worker trigger is BT-1-dependent and intentionally unwired here. All-Lambda = native event-source mapping; Fargate/hybrid = EventBridge Pipes (SQS→ECS target) or SQS-triggered Lambda → ecs:RunTask.

[COMPUTE WORKER] — open decision (BT-1) The component that consumes the queue and runs render → parse → structure → units → extract → embed → finalize. Three live shapes (full comparison in § Why Browser-to-S3 + SQS):

Fargate single task (on-demand run-task): in-process pipeline, D27's --concurrency N semaphore for within-doc parallelism. Simplest fan-in (a loop). 30–60s cold start; no global Anthropic ceiling.
All-Lambda per-unit fan-out: parse/structure/units as sequential per-doc Lambdas; extract fanned out one-Lambda-per-unit with reserved concurrency as a free global rate-limit ceiling; a distributed completion counter (UPDATE … RETURNING remaining_units, last-one-done → finalize) handles fan-in. ~2–10s cold start; ~5× per-doc cost (trivial at beta).
Hybrid: Lambda for the light stages, Fargate for extract. Best-of-both on paper; two runtimes' worth of operational surface.

Common to all: reads uploads S3, writes cache S3 + RDS, resumes from MAX(pages.id) on retry (idempotent via RT-12's unique constraint), polls documents.status for cancel between units.

IAM (RT-5): read uploads bucket org/*/raw/*, write cache bucket org/*, RDS connect, Secrets Manager read on the anthropic-api-key secret ONLY, SQS receive+delete on autri-ingest-jobs only. No Cognito, no Lambda invoke, no other S3. Blast radius of an RCE'd worker: cross-org S3 read within uploads+cache (org-key namespacing is advisory, not IAM-enforced — a hardening item), plus RDS at the app role's level. Worth a dedicated IAM review at implementation.
VPC (RT-19): same private-with-egress subnets as web Lambdas — shares their NAT (no new $32/mo NAT) but pins the worker's network blast radius to theirs. A separate isolating VPC adds ~$32/mo per NAT; not worth it at beta. Named so the coupling is explicit.
Resource sizing (if Fargate): 2 vCPU, 4 GB RAM — render is the memory peak (page images held while parsing). Revisit with telemetry.

Anthropic SDK extractor (replaces CLI subprocess)

Per D12 prod-path split. The CLI extractor exists for Max-plan billing in dev; prod uses the Anthropic API (NOT Bedrock yet — that's D16, scheduled for week 3 post-cutover).
Lift the existing ingestion/extractor/cli-client.ts interface; implement the Anthropic SDK side. Same prompts, same tool-use loop, same JSON output. Replace spawn("claude") + stdout JSON envelopes with client.messages.create({...}) + iterate tool_use blocks via stop_reason. D31 circuit breaker ports cleanly (rate-limit detection moves from stdout parsing to SDK error codes).
RT-13 — rate-limit profile CHANGES. The CLI rides Max-plan SESSION limits; the SDK rides per-API-key TIER limits (e.g. Tier 1 ≈ 50 req/min Sonnet). Action: confirm autri's API tier in the Anthropic Console — that number bounds safe concurrency. Beta risk accepted; the global-ceiling mechanism is BT-1-dependent.

Stuck-pending janitor (RT-18, v1)

Cron Lambda (~50 LoC, EventBridge schedule every ~10 min): SELECT id FROM documents WHERE status='pending' AND created_at < NOW() - INTERVAL '30 min' → re-enqueue to SQS. Covers the lost-S3-event silent-failure mode and reaps orphan rows from presigns that never uploaded (no S3 object → worker marks failed).

Request Lifecycle

Upload + ingest path:

User selects file in KbCreateWizard or AddDocumentsDialog.
Client POSTs /api/kb/[kbId]/upload-url with { filename, fileSize }.
Main Lambda validates D13 access, creates the documents row (status='pending', RT-11), computes objectKey, mints a presign with 60-min expiry + Content-Length-Range [0,100MB] + content-type pin (RT-14), returns { uploadUrl, objectKey, documentId }. Files >100MB are rejected here (RT-15 — single-PUT only for v1).
Client uploads the file directly to S3 via the presigned URL.
S3 fires ObjectCreated → SQS message lands in autri-ingest-jobs.
Trigger fires the compute worker (mechanism is BT-1-dependent — see Component-by-component).
Worker SELECT-FOR-UPDATEs the document row, runs the pipeline, writes results, deletes the message, deletes the raw upload.
Browser's existing pipeline-status poller (getPipelineSnapshot) shows progress — no UI changes.
If an S3 event is lost and the row sits at pending >30 min, the janitor (RT-18) re-enqueues it.

Cache read path:

Inspector renders, fetches /api/cache/org/<orgId>/<kbSlug>/<docSlug>/page-01-image.json.
CloudFront /api/cache/* behavior routes to the cache S3 bucket origin (NOT a Lambda).
S3 serves the cached render directly. CloudFront caches with Cache-Control: max-age=86400.
Lambda is NOT in the read path — one of the biggest cost + latency wins of the rewrite.

Key Interfaces

Producer	Consumer	Interface	Shape
Browser	Main Lambda	`POST /api/kb/[kbId]/upload-url`	`{ filename, fileSize } → { uploadUrl, objectKey, documentId }`
Browser	S3 uploads	Presigned PUT (≤100MB, content-type pinned)	binary file body
S3	SQS	`ObjectCreated` event notification	S3 event JSON
SQS	Compute worker	BT-1-dependent: native event-source mapping (Lambda) OR EventBridge Pipes → `ecs:RunTask` (Fargate/hybrid)	job message
Worker	RDS	`INSERT/UPDATE` documents, pages, chunks	Drizzle schema (+ `UNIQUE(document_id, content_hash)`, RT-12)
Worker	S3 cache	`PutObject`	page PNGs + parse + groupings
CloudFront `/api/cache/*`	S3 cache	GET origin	binary / JSON

Build & Deploy

Build artifacts

app/ (Next.js) — adds the presign + row-creation endpoint; drops local-FS write paths (env-conditional per RT-20). Otherwise unchanged.
ingestion-worker/ (new package) — bundles @autri/retrieval + ingestion code + Anthropic SDK extractor. Packaged as a Fargate container image OR one-or-more Lambda functions (container/zip) per BT-1. Dockerfile mirrors mcp-servers/doc-search/Dockerfile either way.
autri-infra — adds the ingestion constructs: SQS queue + DLQ, cache bucket, the compute target (TaskDefinition or Lambda functions per BT-1), the SQS→worker trigger, the stuck-pending janitor, IAM roles, and the chunks unique-constraint migration.

CDK provisioning vs deploy script split

CDK provisions: SQS queues, cache bucket, the compute target + trigger (BT-1-shaped), the janitor Lambda, IAM, and the Phase-0 chunks migration. (Uploads + cache buckets already exist per NetworkAndData.)
scripts/deploy-worker.sh (new) builds + pushes the worker artifact, bumps its revision (CDK context var per D40 pattern). Shape depends on BT-1 (image push for Fargate; function update for Lambda).
Web stack's scripts/deploy-web.sh continues to deploy Main + Chat without touching the worker.

Deploy phasing (first-deploy bootstrap)

Phase 0: CDK deploys new constructs (SQS queue + DLQ, IAM, cache bucket, compute-trigger wiring per BT-1) + bumps Main Lambda IAM to mint presigns + create rows. Migration: ALTER TABLE chunks ADD CONSTRAINT chunks_doc_content_uniq UNIQUE (document_id, content_hash) (RT-12). Deploy the stuck-pending janitor (RT-18).
Phase 1: Build + push the worker artifact (Fargate image or Lambda container/zip per BT-1); register it. ECS/Lambda reads it on next invocation.
Phase 2: Migrate /api/cache CloudFront behavior to S3 origin (one CloudFront invalidation; cache bucket is empty so cold loads regenerate from worker output).
Phase 3: Cut Main Lambda's stageFiles + fire-and-forget runIngestionPipeline; replace with enqueueIngestion(documentId) (insert row + write SQS message), behind INGESTION_RUNTIME=worker|inline (Q6).

Each phase is independently deployable + reversible. The bootstrap chicken-and-egg (TaskDefinition/Lambda needs an image URI that doesn't exist until built) follows the existing Main Lambda placeholder pattern in lib/web/lambdas.ts:208-222 (RT-6).

Rollback Strategy

TaskDefinition rollback: point ECS at previous TaskDefinition revision (single AWS call).
Code-path rollback: Phase 3's "enqueue vs runInline" is feature-flagged via env var INGESTION_RUNTIME=worker|inline on Main Lambda. Flip back to inline disables the queue path (until we cut the inline code permanently in a follow-up).
Data rollback: none required — schema is unchanged; cache S3 is regeneratable from raw uploads or by re-ingesting.

Cost Shape

Idle (zero traffic)

SQS queue: ~$0 (charged per request; idle = zero requests).
Cache S3: $0.023/GB-month. 10 GB beta cache = $0.23/mo.
Uploads S3: $0.023/GB-month, but lifecycle expires raw after 30 days. Steady state ~5 GB = $0.12/mo.
ECR worker image (Fargate/hybrid): $0.10/GB-month per repo. ~1 GB compressed = $0.10/mo. (All-Lambda container images: same order.)
Compute: $0 idle under both live options (on-demand run-task or event-driven Lambda).
EventBridge: $1.00 per million events; idle = ~$0.
CloudWatch Logs: $0.50/GB ingested; idle = ~$0. (Retention set to 30 days — see Cross-Cutting > Observability.)

Total idle add-on: ~$1/mo on top of existing W3 idle floor.

Beta load (10 docs/day across 10 users)

Compute (the BT-1 decision drives this):

Compute model	Per-doc	Monthly (~300 docs)
Fargate on-demand (2 vCPU, 4 GB, ~5 min/doc)	~$0.008–0.01	~$3–8/mo
All-Lambda per-unit fan-out	~$0.04	~$12/mo

Lambda costs ~5× more per doc because it bills GB-seconds for LLM-I/O wait across every concurrent unit invocation; Fargate's flat task rate doesn't multiply with internal concurrency. The ~$4–9/mo delta is noise next to the LLM bill below.

Rest of the data plane (compute-independent):

S3 storage: ~20 GB cache + uploads ≈ $0.50/mo.
S3 PUT/GET: ~$1/mo.
CloudFront egress for /api/cache/*: ~30 GB/mo (10 docs × ~50 pages × ~200 KB renders × 30 days) × $0.085/GB ≈ ~$2.50/mo.
NAT data egress (Anthropic + OpenAI): ~10 GB/mo ≈ $0.45/mo (uses the existing shared NAT; D16's Bedrock-via-VPC-endpoint would skip NAT later).
SQS + EventBridge + CloudWatch: ~$1/mo.

Total data-plane add-on (excluding LLM): ~$8–15/mo on top of the W3 beta floor, depending on compute model.

LLM extraction (the dominant axis, architecture-independent): ~$0.005/chunk × ~300 chunks/doc × 300 docs/mo ≈ ~$450/mo. This is ~30–45× the infrastructure cost — every architecture decision in this doc is rounding error against it (per D18).

Hypothetical 1k MAU

Compute scales linearly with doc throughput; the Fargate-vs-Lambda per-doc gap widens to ~$400–900/mo at ~30k docs/mo (the ~5× premium becomes material) — but both are still dwarfed by the ~$45k/mo Anthropic spend, and the hot path would be re-optimized before then. This is where BT-1's cost axis finally matters; at beta it doesn't.
The bottleneck becomes Anthropic API quota long before ECS task / Lambda concurrency limits.
S3 storage grows with cache footprint; if it crosses ~1 TB, consider Glacier IA transitions on caches older than 90 days.
SQS scales without intervention; FIFO would cap throughput at 300 msg/s per group but we'd be Standard.

Unit Economics & Cost Model

The only material cost is the LLM extract call. Every other stage — render, parse, structure, units, load, embed, link, finalize — is CPU/IO or a cheap embeddings call (~$0.005/doc for OpenAI text-embedding-3-small); combined they're fractions of a cent per doc. Infra (compute pattern, S3, SQS, CloudFront) adds ~$8–15/mo regardless. So unit economics ≈ the cost of the per-section Haiku call, full stop — and the compute-pattern decision (BT-1) is cost-neutral on this axis.

Validation stage (resolved 2026-05-28): planned to be cut or refactored to a non-LLM heuristic gate, so it does NOT add a second model pass. (If that reverses and validation becomes a real per-page/per-section agent re-check, extract cost roughly doubles — flagged so the decision stays visible.)

Measured cost (Dan's runs, via the CLI agentic loop):

Source type	Per section (unit)	Per chunk	Why
PDF / vision (Genesis: 45 units → 371 chunks, $3.36)	~$0.075	~$0.009	page images attached to the call
Prose / text (novel: 26 chapters → 312 chunks, $0.60)	~$0.023	~$0.002	text-only atoms, fewer LLM turns

The natural billing atom is the section (unit) — one LLM call per section, producing ~8–12 chunks. The 4–5× PDF-vs-prose gap is the page images.

Cost levers (cheapest first):

CLI → SDK direct (already the EPIC-4.5 plan, Q4): kills the agentic Read-tool round-trips. Dan's D27 estimate: −30–50% → PDF ~$0.005, prose ~$0.001/chunk. Free; happening anyway.
Prompt caching (D16 Y1 must-ship): the grouping prompt is static across every section of every doc; cached input bills ~10%. Net ~10–20% off, helps prose more (no images in input), and grows with volume as the cache stays warm across docs.
Batch API (−50%) — situational: ingestion is async, but the user watches a progress bar, so 24h batch latency only fits a future "bulk import, come back later" mode — not the interactive path.
Model choice / self-host — premature: vision is the blocker for self-hosting (PDF needs a capable vision model; a Mac mini can't serve one at throughput), and extraction quality is the product. Graduated path: API+Haiku → caching → Bedrock (D16) → maybe a cheaper/fine-tuned text model for prose grouping at high volume (vision keeps the capable model). Economics only flip at sustained high volume, on cloud GPU — not at beta.

The profitability shape — two facts that reframe the D18 worry:

Extraction is ONE-TIME; revenue is RECURRING. A user ingests once, then queries for months. At SDK+prose rates a typical manuscript (~300 chunks) is ~$0.45 one-time vs $10×N months. Even a cap-saturating author (30k chunks) is ~$45 one-time — profitable past ~5 months retention. The only loss case is saturate-the-cap-then-churn-in-month-1.
Chunks-stored ≠ chunks-extracted. A stored chunk is a cheap row + a ~6KB vector; 30k chunks ≈ 180 MB — near-free to store, cheap to query. The cost is the one-time extraction event, not the resting count. D18's single "chunks" axis conflates a value/storage axis with a cost axis.

Implication for pricing (refines D18; validate in EPIC-5): the cost driver is new extraction volume per month, not total chunks stored. This reconciles the competitive tension Dan raised — storage is cheap to give away, so we can offer roomy KB caps to stay competitive while metering the thing that actually costs money: ingestion throughput. Candidate model: generous stored-chunk caps (value axis) + a monthly new-ingestion budget or metered overage (cost axis), rather than one conflated chunk cap.

Open worry (Dan, 2026-05-28): users may upload aggressively; stingy storage caps could push them to a competitor. The mitigant above (generous storage, metered ingestion) is the working hypothesis — but real upload behavior is unknown until beta. Tracked for EPIC-5 cost telemetry; do not lock pricing before then.

Failure Modes

Worker container fails to pull image (Fargate/hybrid). Trigger has retry-on-failure; after 3 fails the SQS message goes to DLQ + CloudWatch alarm. Recovery: fix ECR repo policy or image tag; replay DLQ (runbook one-liner, RT-17).

Anthropic API rate-limit during extraction. Circuit breaker (D31) trips, worker exits cleanly. SQS visibility timeout expires, message redelivers up to 3× before DLQ. Workers across docs are independent. Note (RT-16): there is no GLOBAL concurrency ceiling under Fargate (per-process semaphore only) — under all-Lambda, reserved concurrency provides one for free. A distributed token bucket is the v1.1 answer if Fargate wins BT-1 and volume climbs.

Worker crashes mid-extraction (OOM, container restart, task abort). Message returns to queue after visibility timeout. On retry, worker reads documents.status + pages already written; resumes from MAX(pages.id). Idempotent because chunks carry a UNIQUE (document_id, content_hash) constraint (RT-12) — re-processed units upsert rather than duplicate.

Browser-S3 upload fails partway. v1 is single-PUT (≤100MB); a failed PUT just means the client retries the whole upload. No partial state in the bucket (multipart deferred to v1.1, RT-15).

SQS message landed but never picked up (capacity exhausted). Visibility timeout + retry; cluster/concurrency scales out. CloudWatch alarm on queue depth >100 for >5 min.

Worker writes to RDS during Multi-AZ failover. Drizzle connection pool re-establishes; worker retries the transaction. SQS visibility timeout + retry covers anything that fails outright.

S3 event → SQS notification dropped. S3 guarantees at-least-once, so this effectively doesn't happen — but if it does, the documents.status='pending' row is reaped by the stuck-pending janitor (RT-18, v1), which re-enqueues rows >30 min old. This is the safety net that makes the lost-event case non-fatal.

Cache bucket misconfigured (CORS reject) — pages don't load in inspector. One CORS-fix + CloudFront invalidation is the recovery. Origin failover not in scope for v1.

Single-region (us-east-1) outage. Whole-beta outage. Explicitly accepted for a ≤10-user beta; multi-region is out of scope.

Why Browser-to-S3 + SQS, With Compute Runtime Left Open

The DATA plane is settled; the COMPUTE plane is the open blue-team decision.

Settled (the data plane): browser uploads directly to S3 via presigned PUT → S3 ObjectCreated event → SQS → [compute worker] → RDS + cache S3 → CloudFront serves /api/cache/* from S3. This bypasses the Lambda 6MB request cap and read-only filesystem in one stroke, regardless of which compute model consumes the queue. No alternative to this shape survived red-team — it's the standard, well-understood pattern.

Open (the compute plane): what consumes the SQS message and runs render → parse → structure → units → extract → embed → finalize. Red-team (2026-05-27) demoted the original "Fargate worker" thesis from chosen to one of three live options, because the pipeline already decomposes into per-section units (D27) that each finish well under any function timeout — which dissolves the 15-min-cap argument that was the main reason to prefer a long-running worker.

Live compute options (blue-team decides — BT-1)

Option	Idle cost	Cold start	Per-doc cost (beta)	Anthropic rate-limit control	Added complexity	Verdict
Fargate single worker (on-demand `run-task`)	~$1/mo	30–60s task launch	~~$0.008–0.01 (~~$3–8/mo)	Per-process semaphore (D27); no global ceiling — cross-task collision is a v1.1 distributed-bucket problem	Lowest — in-process loop, one container, fan-in is trivial	Live — simplest, but worst cold-start + no free global rate limit
All-Lambda (per-unit extract fan-out)	~$0/mo	~2–10s (container Lambda)	~~$0.04 (~~$12/mo)	Reserved concurrency = free global ceiling (cap extract-Lambda at N; SQS holds the rest)	Highest — fan-in needs a distributed completion counter (`UPDATE … RETURNING remaining_units`, last-unit-done → finalize)	Live — best cold-start + free rate limit; ~5× per-doc cost (trivial at beta); most coordination complexity
Hybrid (Lambda for parse/structure/units/embed/finalize, Fargate for extract)	~$1/mo	mixed	between the two	Fargate semaphore for extract; same v1.1 gap	Two runtimes, two deploy pipelines, two cold-start profiles	Live — best-of-both on paper, doubles operational surface

Why Lambda costs ~5× more per doc: extract is LLM-I/O-bound — the function bills GB-seconds while waiting on Haiku, multiplied across every concurrent unit invocation. Fargate's flat task rate doesn't multiply with internal concurrency. The premium is ~$4–9/mo at beta but climbs to ~$400–900/mo at 1k MAU — still dwarfed by the ~$45k/mo Anthropic spend at that volume, and the hot path would be re-optimized before then.

Coupled decision (RT-10): the SQS→worker trigger depends on this choice. All-Lambda uses a native SQS event-source mapping (no EventBridge, no ecs:RunTask). Fargate/hybrid use EventBridge Pipes (SQS→ECS target) or an SQS-triggered Lambda calling ecs:RunTask. Don't wire this until the compute model is picked.

Rejected outright (did not survive red-team)

Option	Why rejected
AgentCore Tools	AgentCore is for agent-driven invocations; ingestion is a fixed pipeline, not agent-shaped. Forcing it over-couples to the MCP runtime. The compound-benefit argument (`project_autri_mcp_wedge.md`) doesn't apply — ingestion writes the chunks the MCP reads, but isn't itself an MCP concern.
Step Functions orchestration	Adds a state-machine DSL + per-transition cost (~$3/mo beta, superlinear). The DB already provides stage observability via `documents.status`. Caveat surfaced by red-team: if all-Lambda fan-out wins BT-1, Step Functions Map is the natural fan-in orchestrator and its cost deserves a fresh look at that point — but as a standalone v1 orchestration layer it's rejected.
Single monolithic Lambda + extended timeout	The 15-min hard cap doesn't cover a whole-doc extract (~30 min). NB: this is the monolithic framing — the per-unit-fan-out Lambda option above is different and is NOT rejected.
Worker inside the existing MCP container	Couples ingestion failure mode to MCP availability; ingestion concurrency blows the MCP latency budget (<200ms p99). The MCP container is a query surface, not a compute surface.

Epic	Relationship
EPIC-4 (AWS deploy)	Predecessor. W3 web stack + AgentCore Runtime + Cognito all in place. EPIC-4.5 fits into the existing CDK app structure.
EPIC-5 (beta onboarding + cost data)	Successor. Cost telemetry (per-org Anthropic spend, Fargate-hour usage) gets surfaced in EPIC-5's cost dashboard.
D16 (Bedrock for prod LLM)	Future swap-in. Worker's `cli-client.ts`-shaped extractor abstraction is the seam — Anthropic SDK now, Bedrock later, single-file change.
D44 (Connector backend API-route)	Independent. D44's connector creation still uses Main Lambda; not touched by this sub-system.

Cross-Cutting Concerns

Concern	How this sub-system handles it
D13 multi-tenancy	Worker reads `organizationId` from the job context (set by Main Lambda at row-creation/enqueue); all DB writes scoped via `knowledge_bases.organization_id`. Cache S3 keys are org-namespaced.
D19 MCP-as-infrastructure	Worker writes the same `chunks` table the MCP server reads. Inspector hash-anchors still work post-ingestion. No MCP-side change.
D38 unified agent surface	No effect. Worker only writes; agent surfaces read.
Observability	Worker emits structured stdout JSON → CloudWatch Logs (same pattern as the MCP container's audit log). Log schema: `{ ts, severity, component, documentId, organizationId, stage, durationMs, message }`. Per-doc / per-unit / per-LLM-call spans. Retention: 30 days. Alarms: SQS depth >100 for >5 min; DLQ count >0; task/invocation failures >0/hr; worker duration >1 hr (likely stuck). v1.1: trace IDs propagated from web request through SQS.
Cost control	LLM spend is the dominant axis (D18); infra is rounding error. Worker tracks per-doc `extraction_cost_cents` (new column — plumbed in EPIC-5, not here). API-tier dependency tracked as BT-2.
D47 server-action ID drift	`/api/kb/[kbId]/upload-url` is an API route, not a server action — survives across deploys by construction.

Decisions Log

Date	Decision	Status	Rationale
2026-05-27	Data plane: Browser-to-S3 presigned + SQS + [compute] → RDS + cache S3	Settled (survived red-team)	Decouples ingestion from web Lambdas; bypasses 6MB cap + RO-FS regardless of compute model. Standard pattern.
2026-05-27 → resolved 2026-05-28	Compute runtime → all-Lambda per-unit fan-out	RESOLVED (BT-1)	Blue-team chose all-Lambda over Fargate/hybrid: serverless-native once (no later migration) + free global rate ceiling via reserved concurrency; accepted fan-in complexity. Contract: `requirements/cloud-native-ingestion`.
2026-05-27	RT-11 — Main Lambda creates the `documents` row at presign time (status='pending'); S3 `ObjectCreated` event triggers the worker, which resolves the row by object key	Settled	Row exists for the full lifecycle so the UI poller always finds it. Orphan-pending rows (presign without upload) reaped by the stuck-pending janitor.
2026-05-27	RT-12 — Add `UNIQUE (document_id, content_hash)` to chunks	Settled (Phase-0 migration)	Makes "resume from `MAX(pages.id)` on retry" idempotent — partial chunks from a crashed extract dedupe on re-run rather than duplicate. Also underpins the fan-in marker design.
2026-05-27	RT-14 — Presigned URL enforces `Content-Length-Range [0,100MB]` + content-type allowlist (pdf/docx)	Settled (beta-blocker)	Without it, any signed-in user dumps arbitrary bytes into the uploads bucket. Constraint lives in the presign, not just downstream.
2026-05-27	RT-15 — Cap uploads at 100MB single-PUT for v1; defer multipart to v1.1	Settled	STEM Racing's largest file ~9MB. Drops a day of multipart-presign complexity (Initiate + per-part + Complete).
2026-05-27	RT-18 — Stuck-pending janitor ships in v1	Settled	~50 LoC cron Lambda re-enqueues `status='pending' AND created_at < NOW()-30min`. Also detects stuck-`ingesting` docs (the beta "stuck worker" alarm).
2026-05-27	RT-20 — Env-conditional at each I/O boundary; prod writes S3, dev keeps local FS	Settled	`if (env.CACHE_BUCKET) putS3() else writeLocal()` at each of the ~5-6 write sites. No new abstraction layer; corrects the earlier "single-file switch" overstatement.
2026-05-27	Q3 — Standard SQS + `documentId` idempotency (not FIFO)	Holds	Higher throughput, lower cost; worker SELECT-FOR-UPDATE handles same-doc duplicates.
2026-05-27	Q4 — Anthropic SDK (not Bedrock) for extraction in v1	Holds, with RT-13 caveat	Per D12 split. Rate-limit profile CHANGES (per-key tier, not Max session) — see BT-2. D16 Bedrock swap week 3.
2026-05-27	Q5 — Single shared cache S3 bucket, org-namespaced keys	Holds	Simplest IAM; 100-bucket-per-account default limit kills per-org buckets.
2026-05-27	Q6 — Feature-flag inline→worker cutover (`INGESTION_RUNTIME`)	Holds	Phase-3 rollback = flip to inline.
2026-05-28	RT-10 — SQS→worker trigger = native SQS event-source mapping	Resolved (follows BT-1)	All-Lambda uses native SQS event-source mappings on all three queues; no EventBridge / `ecs:RunTask`.
2026-05-28	Prompt caching deferred to v1.1; observability = logs + 2 alarms	Settled (blue-team)	SDK-direct already gives −30–50%; measure before optimizing further. Beta observability scoped to any-DLQ >0 + stuck-`ingesting` alarm.

Open Questions for Blue-Team

Blue-team ran 2026-05-28. BT-1 is resolved; the implementation contract is projects/autri/requirements/cloud-native-ingestion.

BT-1 — RESOLVED → all-Lambda per-unit fan-out. Chosen over Fargate single-worker and hybrid: serverless-native once (no Fargate→Lambda migration later) + reserved concurrency gives a free global Anthropic rate-limit ceiling. Accepted cost: fan-in coordination — idempotent per-unit markers (document_units UNIQUE) + a single-winner finalize CAS guard, specced in the requirements doc. Prompt caching deferred to v1.1; observability scoped to structured logs + 2 alarms (any-DLQ >0; doc stuck ingesting >40 min).

BT-2 — Action item (open). Confirm autri's Anthropic API tier in the Console; it sets the Extract Lambda's reserved concurrency (the rate ceiling). Blocks Phase-1 sizing only, not Phase 0.

v1.1 deferrals (unchanged): multipart >100MB (RT-15), distributed token bucket (RT-16 — moot under all-Lambda unless tier ceilings change), SSE progress, retry UI, mid-LLM-call cancel, cache-bucket janitor, prompt caching, render fan-out per-page-range.

Carried-forward awareness items: per-doc extraction_cost_cents (EPIC-5); org-key S3 namespacing is advisory, not IAM-enforced (RT-5 hardening).

Known Issues / Tech Debt

Issue	Severity	Notes
Local-FS-vs-S3 handled by env-conditional at each of ~5-6 write sites (RT-20)	Low	`if (env.CACHE_BUCKET) putS3() else writeLocal()` in parse / render / parse-docx / extractor / stage-files + `/api/cache`. Corrects the earlier "single-file switch" overstatement — it's a per-site boundary check, not one switch. No new abstraction layer (deliberate).
Worker image cold-start (Fargate) dominated by `pnpm deploy --prod` bundle (~1 GB)	Medium	Layer caching helps repeat deploys. Source the real number from the `mcp.autri.ai` container's measured cold start (D39) rather than the "30-60s" estimate. Pre-compile to `dist/` if telemetry shows it matters. Moot if all-Lambda wins (BT-1).
Multipart upload for files >100MB deferred to v1.1 (RT-15)	Low	Single-PUT covers all known beta files (~9MB max). Add Initiate/per-part/Complete presign API only when a real user hits the cap.
Distributed Anthropic token bucket deferred to v1.1 (RT-16)	Low	Only needed if Fargate wins BT-1 AND concurrent-doc volume climbs. Free under all-Lambda via reserved concurrency.
Per-doc cost tracking needs a `documents.extraction_cost_cents` column	Low	EPIC-5 cost dashboard depends on it; plumbed there, not here.
Cache-bucket janitor for deleted-doc caches (90-day)	Low	Needs a DB-join or S3 inventory query. v1.1.
Org-key namespacing on cache/uploads is advisory, not IAM-enforced per-org	Medium	A single worker role can read any org's keys. Fine for a single-worker beta; per-org IAM scoping is a hardening item if the RCE blast radius (RT-5) becomes a concern.
No retry UI for failed docs	Low	Dan's existing "delete + re-upload" flow covers beta. v1.1 adds a single "retry" button that re-enqueues the SQS message.
Mid-LLM-call cancel (vs mid-unit cancel)	Low	D27 already deferred this; the worker boundary doesn't change the answer. Per-unit cancel granularity is fine for beta.

Red-teamed 2026-05-27. Compute runtime (BT-1) is the open blue-team decision; data plane settled. Next: /hl:blue-team.

Risks & Constraints#

Overview#

Current Status#

The Story#

What Is This Sub-system?#

Architecture#

The Big Idea#

Architecture Diagram#

System Boundary#

Component-by-component#

Request Lifecycle#

Key Interfaces#

Build & Deploy#

Build artifacts#

CDK provisioning vs deploy script split#

Deploy phasing (first-deploy bootstrap)#

Rollback Strategy#

Cost Shape#

Idle (zero traffic)#

Beta load (10 docs/day across 10 users)#

Hypothetical 1k MAU#

Unit Economics & Cost Model#

Failure Modes#

Why Browser-to-S3 + SQS, With Compute Runtime Left Open#

Live compute options (blue-team decides — BT-1)#

Rejected outright (did not survive red-team)#

Related Epics#

Cross-Cutting Concerns#

Decisions Log#

Open Questions for Blue-Team#

Known Issues / Tech Debt#

Review