Implementation contract for EPIC-4.5 — the cloud-native rewrite of autri's ingestion + file-upload pipeline. Derived from sub-systems/cloud-native-ingestion via /hl:red-team (2026-05-27) + /hl:blue-team (2026-05-28). The sub-system doc holds the full rationale + rejected alternatives + cost model; THIS doc is the contract: MVP scope, the locked architecture, data-model + interface changes, the phased plan, and the definition of done.
Compute runtime locked to all-Lambda per-unit fan-out (BT-1, 2026-05-28). Rationale: serverless-native once (no Fargate→Lambda migration later), and reserved concurrency gives a free global Anthropic rate-limit ceiling. The accepted cost is fan-in coordination complexity — specified precisely below because it's the load-bearing risk.
Scope
MVP — must ship (EPIC-4.5 cannot unblock beta without these)
- Browser-to-S3 presigned upload; presign enforces
Content-Length-Range [0,100MB]+ content-type pin (RT-14, RT-15 single-PUT). POST /api/kb/[kbId]/upload-urlminter that also creates thedocumentsrow (RT-11), D13-scoped.- Three SQS queues (ingest / extract / finalize) + 3 DLQs; native SQS event-source mappings (RT-10 resolved — no EventBridge /
ecs:RunTask). - Three Lambda roles (Prep / Extract / Finalize) from one container image; reserved concurrency on Extract = the rate ceiling.
- Anthropic SDK extractor replacing
spawn("claude")(Q4) — the CLI cannot run in the deployed worker. - S3 cache bucket +
/api/cache/*CloudFront → S3 origin (Lambda out of the read path). - Env-conditional FS writes at each site (RT-20);
chunksUNIQUE(document_id, content_hash)(RT-12). document_unitsfan-in markers + single-winner finalize guard (see Architecture).INGESTION_RUNTIME=worker|inlinefeature flag (Q6); stuck-pending + stuck-ingesting janitor (RT-18).- Prompt caching on the SDK extractor (amended 2026-05-28 — pulled from v1.1; see Amendments). Mark the static grouping-prompt prefix (instructions + JSON tool schema + system) with
cache_control: {type:"ephemeral"}, ordered prefix-first. At Tier 1 this is a throughput lever (ITPM excludes cache reads), not just the ~10–20% cost cut; the per-unit Extract fan-out reuses the prefix within the 5-min TTL. Verify the prefix clears Haiku's 2048-token cache floor viacache_creation_input_tokens/cache_read_input_tokens; pad with static schema if short. Does NOT relieve the OTPM ceiling.
Should — in scope if time, not blocking
- DLQ replay runbook one-liner (RT-17).
- Minimal observability: structured stdout JSON logs + two alarms (any-DLQ depth >0; doc stuck in
ingesting>40 min — the janitor doubles as the detector). - Extract Lambda drains available
extract-jobsbefore exiting where the runtime allows, to amortize cold-start across bulk uploads. - Per-doc extraction cost logged (not a DB column yet — that's EPIC-5).
Deferred to v1.1
- Multipart upload >100MB (RT-15); distributed Anthropic token bucket (RT-16 — moot under all-Lambda's reserved concurrency unless tier ceilings change); full alarm matrix + trace-ID propagation; cache-bucket janitor for deleted-doc caches; retry UI; render fan-out per-page-range (only if a doc's render approaches the 15-min Prep cap).
Out of scope / other epic
- Agent-review / validation stage — REMOVED entirely (2026-05-28). The pipeline is
render → parse → structure → units → load → extract → embed → finalize → readywith no validation pass; thevalidatingstatus is dropped. No product loss:confidence_tieris set byfinalize(not validation), and the user-facing confidence-tiered review queue (D24, the inspector) is a separate surface that's unaffected. Removing it makes every stage a pure, independently testable function — which the Lambda decomposition wants anyway. - Bedrock extraction (D16, week 3); multi-region;
documents.extraction_cost_centscolumn (EPIC-5); per-org IAM-enforced S3 namespacing (hardening, v1.1); SSE real-time progress.
Locked Architecture — all-Lambda per-unit fan-out
The data plane is unchanged from the design doc (browser→S3 presigned → SQS → worker → RDS + cache S3; CloudFront serves /api/cache from S3). BT-1 resolved the compute plane to all-Lambda: the pipeline is decomposed into three Lambda roles connected by SQS, with per-unit fan-out on extract.
S3 ObjectCreated ─► SQS(ingest-jobs) ─► [PREP Lambda]
render → parse → structure → units → load
writes documents{status, units_total}
enqueues 1 extract-job PER UNIT
│
▼
SQS(extract-jobs)
(reserved concurrency = R ◄── global rate ceiling)
│
┌────────────┴────────────┐ (N units, fanned out)
▼ ▼
[EXTRACT Lambda] [EXTRACT Lambda] ...
group atoms → chunks (SDK)
dedup via UNIQUE(doc,content_hash)
upsert document_units marker (idempotent)
if remaining==0 → CAS-guard → enqueue finalize
│
▼
SQS(finalize-jobs)
│
▼
[FINALIZE Lambda]
link-sections → embed → finalize → status='ready'
One container image, three handlers (Prep / Extract / Finalize) — mirrors the D40 MCP-container Dockerfile pattern (pnpm deploy --prod + tsx).
PREP — consumes ingest-jobs (from the S3 event). Runs render → parse → structure → units → load. Resolves the documents row by object key (created at presign, RT-11), sets units_total = N, enqueues one extract-job per unit. Fits the 15-min Lambda cap for beta doc sizes (≤100MB; ≤~800pp render < ~10 min). If render ever exceeds, fan render out per-page-range (v1.1).
EXTRACT — consumes extract-jobs, reserved concurrency = R (the free global rate ceiling; R derived from BT-2's API tier). Processes ONE unit: groups atoms → chunks via the SDK extractor, writes chunks (dedup on UNIQUE(document_id, content_hash)). Then runs the fan-in step (below).
FINALIZE — consumes finalize-jobs (enqueued exactly once by the extract invocation that completes the last unit). Runs link-sections → embed → finalize → status='ready'.
Fan-in — the load-bearing mechanism (spec precisely)
SQS is at-least-once, so a blind units_done++ over-counts on redelivery and finalizes early. Required design:
- Per-unit idempotent marker, not a counter. New table
document_units (document_id, unit_id, extracted_at, UNIQUE(document_id, unit_id)). The Extract Lambda upserts its unit's markerON CONFLICT DO NOTHINGafter writing chunks. - Completion check by count of distinct markers:
remaining = units_total − (SELECT count(*) FROM document_units WHERE document_id = X AND extracted_at IS NOT NULL). - Single-winner finalize guard (CAS): when
remaining = 0, runUPDATE documents SET finalize_enqueued_at = now() WHERE id = X AND finalize_enqueued_at IS NULL RETURNING id. Only the row-returning invocation enqueues the finalize-job. This makes finalize exactly-once even if two units finish concurrently or a unit redelivers. - Test obligation: a unit test MUST simulate a duplicated/redelivered extract-job and assert (a) no duplicate chunks, (b) finalize enqueued exactly once.
Settled decisions (reference — rationale in the sub-system doc)
| ID | Decision |
|---|---|
| BT-1 | Compute runtime = all-Lambda per-unit fan-out |
| RT-10 | SQS→worker trigger = native SQS event-source mapping (no EventBridge/ECS) |
| RT-11 | Main Lambda creates documents row at presign; worker resolves by object key |
| RT-12 | chunks UNIQUE(document_id, content_hash) for idempotent retry |
| RT-14 | Presign enforces content-length ≤100MB + content-type allowlist (pdf/docx) |
| RT-15 | Single-PUT only for v1; multipart → v1.1 |
| RT-18 | Stuck-pending janitor ships in v1 |
| RT-20 | Env-conditional FS at each write site; prod=S3, dev=local |
| Q3 | Standard SQS + documentId idempotency |
| Q4 | Anthropic SDK extractor in v1 (Bedrock = D16, week 3) |
| Q5 | Single shared cache S3 bucket, org-namespaced keys |
| Q6 | Feature-flag inline→worker cutover |
Data model changes
chunks— addcontent_hashcolumn if not present (SHA-256 of chunk content, computed on write) +UNIQUE(document_id, content_hash)(RT-12).documents— addunits_total INT,finalize_enqueued_at TIMESTAMPTZ(fan-in). status/progress columns already exist.document_units(new) —(document_id UUID, unit_id TEXT, extracted_at TIMESTAMPTZ, UNIQUE(document_id, unit_id)).
Interfaces / contracts
| Producer | Consumer | Interface | Shape |
|---|---|---|---|
| Browser | Main Lambda | POST /api/kb/[kbId]/upload-url | { filename, fileSize } → { uploadUrl, objectKey, documentId } |
| Browser | S3 uploads | Presigned PUT (≤100MB, content-type pinned) | binary |
| S3 | SQS ingest-jobs | ObjectCreated | { documentId, organizationId, kbId, sourceObjectKey, sourceExt, extractorModel } |
| Prep | SQS extract-jobs | enqueue per unit | { documentId, organizationId, kbId, unitId, extractorModel } |
| Extract | SQS finalize-jobs | enqueue once (CAS-guarded) | { documentId, organizationId, kbId } |
| Worker | RDS | INSERT/UPDATE documents, pages, chunks, document_units | Drizzle schema (+ new cols) |
| Worker | S3 cache | PutObject | page PNGs + parse + groupings |
CloudFront /api/cache/* | S3 cache | GET origin | binary / JSON |
Implementation plan (phased — each independently deployable + reversible)
All pipeline infra lives in a dedicated AutriIngestion CDK stack inside autri-infra (not a separate app — see § Deployment Topology & Reversibility), deployable in isolation via cdk deploy AutriIngestion.
- Phase 0 (CDK + migrations): the
AutriIngestionstack — 3 SQS queues + 3 DLQs, IAM roles (Extract: SDK secret + cache write + RDS; Prep: uploads read + cache write + RDS; Finalize: RDS + cache + OpenAI egress), the janitor Lambda + its schedule — plus a small touch on the web stack for the presign IAM grant on Main Lambda. Schema migrations run through the existingdb/migration path (not a new CDK custom resource): chunkscontent_hash+ UNIQUE,documentscolumns,document_unitstable. (Native SQS event-source mappings move to Phase 1 — they require the Prep/Extract/Finalize functions, which don't exist until the worker image; see Amendments.) - Phase 1 (worker image + functions): build + push the container image; register Prep/Extract/Finalize functions + native SQS event-source mappings + Extract reserved concurrency = R. Implement the SDK extractor (lift
cli-client.tsinterface; replacespawn("claude")withclient.messages.create+ tool-use loop; port the D31 circuit breaker to SDK error codes; mark the static prompt prefix withcache_controlper the MVP prompt-caching item). - Phase 2 (cache read path):
/api/cache/*CloudFront behavior → S3 origin. This change lives in the web stack (it owns the distribution), pointing at the cache bucket (in NetworkAndData). One invalidation; empty bucket regenerates from worker output. - Phase 3 (cutover): replace Main Lambda's inline
stageFiles+ fire-and-forgetrunIngestionPipelinewith presign + row-create + the S3-event path, behindINGESTION_RUNTIME=worker|inline. Flip-to-inline is the rollback.
Deployment Topology & Reversibility
Pipeline infra = a dedicated AutriIngestion stack inside the autri-infra CDK app — not a separate app/repo. The pipeline shares VPC, RDS, the uploads + cache buckets, and the Anthropic secret (all in NetworkAndData), and the cache-read path touches the web CloudFront distribution. A separate CDK app would force brittle cross-app ARN export/import wiring for every one of those dependencies; a separate stack gives the same wins without it:
- Isolated, fast iteration —
cdk deploy AutriIngestionsynthesizes + deploys only the pipeline. - Blast-radius safety — a bad pipeline deploy can't break the web or auth stacks (separate CloudFormation stacks).
- Type-safe shared refs — VPC / RDS / buckets / secret come in as construct references, not hand-wired ARNs.
- Matches how
autri-infrais already organized (per-concern stacks: NetworkAndData, web, auth-and-compute).
Ownership seams (each stack owns its own resources):
AutriIngestion: SQS queues + DLQs, the 3 Lambda functions + event-source mappings + reserved concurrency, the janitor, pipeline IAM roles.- NetworkAndData: VPC, RDS, uploads + cache buckets, Anthropic secret.
- Web stack: the
/api/cache/*CloudFront behavior (it owns the distribution) + the presign IAM grant on Main Lambda. - Schema migrations: the existing
db/path, not CDK custom resources.
Stages are pure functions; the runtime is a thin shell (this is what keeps the Fargate pivot cheap). Each stage is a pure function — runRender(doc), runParse(doc), … runExtract(unit), runFinalize(doc) — with no Lambda/SQS coupling inside it. The Lambda handlers (and the janitor) are thin shells: SQS wiring, the fan-in marker writes, and status updates live in the shell, not the stage. Two payoffs:
- Testability — every stage is unit-testable with no AWS in the loop (and the fan-in test in the DoD lives at the shell layer).
- Cheap Fargate pivot (pre-authorized fallback) — if the all-Lambda fan-in proves too painful in practice, pivot back to a single Fargate worker by swapping the shell for an in-process loop that calls the same stage functions. No stage rewrite; the data plane (S3 + SQS + RDS + cache) is unchanged;
INGESTION_RUNTIME=worker|inlineis the cutover seam. Take the Lambda path, and if it fights us, fall back without losing the stage work.
Definition of Done
- A PDF and a DOCX uploaded via the browser both reach
status='ready'; chunks are queryable; page renders serve from S3 in the inspector. - D13: presign rejects a
kbIdthe caller's org doesn't own. - Idempotency test passes: a redelivered extract-job produces no duplicate chunks and finalize fires exactly once.
- A forced extract failure DLQs after 3 receives + alarms; the runbook one-liner replays it.
- A doc whose units never complete is re-surfaced by the janitor within ~40 min.
- Per-doc extraction cost is visible in logs (feeds EPIC-5 cost telemetry).
Open action items
- BT-2 — RESOLVED (2026-05-28): autri's Anthropic org is Tier 1. Haiku limits: 50 RPM, 50K ITPM, 10K OTPM. The binding constraint for extract is tokens-per-minute, not requests — and OTPM (10K) is the harder ceiling (each call emits a groupings JSON; caching can't relieve output). So Extract Lambda reserved concurrency
Rstarts low (~3) and tunes up from telemetry, nowhere near 50 RPM. The all-Lambda reserved-concurrency design absorbs this: overflow waits in SQS, no rate-limit errors blow up docs. - Tier 2 (auto-unlocks at $40 cumulative spend — beta crosses it fast) lifts both ceilings materially. Re-tune
Rupward when it does. - Credit top-up (new): org balance ~$2.30 ≈ one Bible-sized doc ($2–3). Top up / enable auto-reload before EPIC-4.5 Phase-1 ingestion testing. Phase 0 (CDK + migrations) needs no credits.
- Prompt caching for MVP — RESOLVED (2026-05-28): pulled from v1.1 into MVP (Dan's call at kickoff). The Tier-1 ITPM limit excludes cache reads, so caching the static grouping-prompt prefix is a throughput lever (relieves the 50K ITPM ceiling), on top of the ~10–20% cost cut; it does NOT relieve the 10K OTPM ceiling. Implemented in Phase 1 when the SDK extractor is written fresh (cheap to add then — ~1–2 lines). Caveat: verify the static prefix clears Haiku's 2048-token cache floor. See § Amendments + MVP scope.
Blue-teamed 2026-05-28. Compute locked (all-Lambda). BT-2 resolved (Tier 1). Prompt caching amended into MVP. Ready for EPIC-4.5 implementation, Phase 0 first.
Amendments
Scope changes after the contract was first locked. Logged here per methodology (amendments are visible, not silent).
| Date | Amendment | Rationale |
|---|---|---|
| 2026-05-28 | Prompt caching on the SDK extractor pulled from v1.1 → MVP. | At Tier 1, ITPM excludes cache reads, so caching the static grouping-prompt prefix is a throughput lever (relieves the 50K ITPM ceiling), not just the ~10–20% cost cut. The SDK extractor is written fresh in Phase 1, so cache_control is ~1–2 lines added then — cheap now, not a separate pass later. Caveat: verify the static prefix clears Haiku's 2048-token cache floor (cache_creation_input_tokens / cache_read_input_tokens); does NOT relieve OTPM. Dan's call at EPIC-4.5 kickoff. |
| 2026-05-28 | Native SQS event-source mappings moved Phase 0 → Phase 1. | An event-source mapping needs a target Lambda; the Prep/Extract/Finalize functions don't exist until the Phase 1 worker image. Phase 0 ships the queues + DLQs + IAM roles + janitor; Phase 1 attaches the mappings when the functions register. Sequencing only, not a scope change. |
| 2026-05-28 | RT-12 chunk idempotency: content_hash UNIQUE → per-unit replace (migration 009). | content_hash as the idempotency key is doubly wrong: (1) a redelivered extract-job re-runs the non-deterministic LLM, so the re-extracted chunks differ → the UNIQUE never catches the duplicate (it fails at its job); (2) legitimate duplicate-content chunks (refrains, boilerplate, identical table rows) false-collide → silent drop or doc failure. Fix: drop the UNIQUE, add chunks.unit_index, and have the Extract worker DELETE a unit's chunks before re-inserting — idempotent regardless of LLM output, no false collisions. content_hash stays as a non-unique column for embed change-detection. The related document-level versioning + incremental re-ingestion concern is drafted in sub-systems/incremental-re-ingestion (v1.1). |