Foundry Foundry

Implementation contract for EPIC-4.5 — the cloud-native rewrite of autri's ingestion + file-upload pipeline. Derived from sub-systems/cloud-native-ingestion via /hl:red-team (2026-05-27) + /hl:blue-team (2026-05-28). The sub-system doc holds the full rationale + rejected alternatives + cost model; THIS doc is the contract: MVP scope, the locked architecture, data-model + interface changes, the phased plan, and the definition of done.

Compute runtime locked to all-Lambda per-unit fan-out (BT-1, 2026-05-28). Rationale: serverless-native once (no Fargate→Lambda migration later), and reserved concurrency gives a free global Anthropic rate-limit ceiling. The accepted cost is fan-in coordination complexity — specified precisely below because it's the load-bearing risk.


Scope

MVP — must ship (EPIC-4.5 cannot unblock beta without these)

  • Browser-to-S3 presigned upload; presign enforces Content-Length-Range [0,100MB] + content-type pin (RT-14, RT-15 single-PUT).
  • POST /api/kb/[kbId]/upload-url minter that also creates the documents row (RT-11), D13-scoped.
  • Three SQS queues (ingest / extract / finalize) + 3 DLQs; native SQS event-source mappings (RT-10 resolved — no EventBridge / ecs:RunTask).
  • Three Lambda roles (Prep / Extract / Finalize) from one container image; reserved concurrency on Extract = the rate ceiling.
  • Anthropic SDK extractor replacing spawn("claude") (Q4) — the CLI cannot run in the deployed worker.
  • S3 cache bucket + /api/cache/* CloudFront → S3 origin (Lambda out of the read path).
  • Env-conditional FS writes at each site (RT-20); chunks UNIQUE(document_id, content_hash) (RT-12).
  • document_units fan-in markers + single-winner finalize guard (see Architecture).
  • INGESTION_RUNTIME=worker|inline feature flag (Q6); stuck-pending + stuck-ingesting janitor (RT-18).
  • Prompt caching on the SDK extractor (amended 2026-05-28 — pulled from v1.1; see Amendments). Mark the static grouping-prompt prefix (instructions + JSON tool schema + system) with cache_control: {type:"ephemeral"}, ordered prefix-first. At Tier 1 this is a throughput lever (ITPM excludes cache reads), not just the ~10–20% cost cut; the per-unit Extract fan-out reuses the prefix within the 5-min TTL. Verify the prefix clears Haiku's 2048-token cache floor via cache_creation_input_tokens / cache_read_input_tokens; pad with static schema if short. Does NOT relieve the OTPM ceiling.

Should — in scope if time, not blocking

  • DLQ replay runbook one-liner (RT-17).
  • Minimal observability: structured stdout JSON logs + two alarms (any-DLQ depth >0; doc stuck in ingesting >40 min — the janitor doubles as the detector).
  • Extract Lambda drains available extract-jobs before exiting where the runtime allows, to amortize cold-start across bulk uploads.
  • Per-doc extraction cost logged (not a DB column yet — that's EPIC-5).

Deferred to v1.1

  • Multipart upload >100MB (RT-15); distributed Anthropic token bucket (RT-16 — moot under all-Lambda's reserved concurrency unless tier ceilings change); full alarm matrix + trace-ID propagation; cache-bucket janitor for deleted-doc caches; retry UI; render fan-out per-page-range (only if a doc's render approaches the 15-min Prep cap).

Out of scope / other epic

  • Agent-review / validation stage — REMOVED entirely (2026-05-28). The pipeline is render → parse → structure → units → load → extract → embed → finalize → ready with no validation pass; the validating status is dropped. No product loss: confidence_tier is set by finalize (not validation), and the user-facing confidence-tiered review queue (D24, the inspector) is a separate surface that's unaffected. Removing it makes every stage a pure, independently testable function — which the Lambda decomposition wants anyway.
  • Bedrock extraction (D16, week 3); multi-region; documents.extraction_cost_cents column (EPIC-5); per-org IAM-enforced S3 namespacing (hardening, v1.1); SSE real-time progress.

Locked Architecture — all-Lambda per-unit fan-out

The data plane is unchanged from the design doc (browser→S3 presigned → SQS → worker → RDS + cache S3; CloudFront serves /api/cache from S3). BT-1 resolved the compute plane to all-Lambda: the pipeline is decomposed into three Lambda roles connected by SQS, with per-unit fan-out on extract.

S3 ObjectCreated ─► SQS(ingest-jobs) ─► [PREP Lambda]
                                          render → parse → structure → units → load
                                          writes documents{status, units_total}
                                          enqueues 1 extract-job PER UNIT
                                                    │
                                                    ▼
                                         SQS(extract-jobs)
                                          (reserved concurrency = R  ◄── global rate ceiling)
                                                    │
                                       ┌────────────┴────────────┐  (N units, fanned out)
                                       ▼                         ▼
                                 [EXTRACT Lambda]          [EXTRACT Lambda] ...
                                  group atoms → chunks (SDK)
                                  dedup via UNIQUE(doc,content_hash)
                                  upsert document_units marker (idempotent)
                                  if remaining==0 → CAS-guard → enqueue finalize
                                                    │
                                                    ▼
                                         SQS(finalize-jobs)
                                                    │
                                                    ▼
                                          [FINALIZE Lambda]
                                           link-sections → embed → finalize → status='ready'

One container image, three handlers (Prep / Extract / Finalize) — mirrors the D40 MCP-container Dockerfile pattern (pnpm deploy --prod + tsx).

PREP — consumes ingest-jobs (from the S3 event). Runs render → parse → structure → units → load. Resolves the documents row by object key (created at presign, RT-11), sets units_total = N, enqueues one extract-job per unit. Fits the 15-min Lambda cap for beta doc sizes (≤100MB; ≤~800pp render < ~10 min). If render ever exceeds, fan render out per-page-range (v1.1).

EXTRACT — consumes extract-jobs, reserved concurrency = R (the free global rate ceiling; R derived from BT-2's API tier). Processes ONE unit: groups atoms → chunks via the SDK extractor, writes chunks (dedup on UNIQUE(document_id, content_hash)). Then runs the fan-in step (below).

FINALIZE — consumes finalize-jobs (enqueued exactly once by the extract invocation that completes the last unit). Runs link-sections → embed → finalize → status='ready'.

Fan-in — the load-bearing mechanism (spec precisely)

SQS is at-least-once, so a blind units_done++ over-counts on redelivery and finalizes early. Required design:

  1. Per-unit idempotent marker, not a counter. New table document_units (document_id, unit_id, extracted_at, UNIQUE(document_id, unit_id)). The Extract Lambda upserts its unit's marker ON CONFLICT DO NOTHING after writing chunks.
  2. Completion check by count of distinct markers: remaining = units_total − (SELECT count(*) FROM document_units WHERE document_id = X AND extracted_at IS NOT NULL).
  3. Single-winner finalize guard (CAS): when remaining = 0, run UPDATE documents SET finalize_enqueued_at = now() WHERE id = X AND finalize_enqueued_at IS NULL RETURNING id. Only the row-returning invocation enqueues the finalize-job. This makes finalize exactly-once even if two units finish concurrently or a unit redelivers.
  4. Test obligation: a unit test MUST simulate a duplicated/redelivered extract-job and assert (a) no duplicate chunks, (b) finalize enqueued exactly once.

Settled decisions (reference — rationale in the sub-system doc)

IDDecision
BT-1Compute runtime = all-Lambda per-unit fan-out
RT-10SQS→worker trigger = native SQS event-source mapping (no EventBridge/ECS)
RT-11Main Lambda creates documents row at presign; worker resolves by object key
RT-12chunks UNIQUE(document_id, content_hash) for idempotent retry
RT-14Presign enforces content-length ≤100MB + content-type allowlist (pdf/docx)
RT-15Single-PUT only for v1; multipart → v1.1
RT-18Stuck-pending janitor ships in v1
RT-20Env-conditional FS at each write site; prod=S3, dev=local
Q3Standard SQS + documentId idempotency
Q4Anthropic SDK extractor in v1 (Bedrock = D16, week 3)
Q5Single shared cache S3 bucket, org-namespaced keys
Q6Feature-flag inline→worker cutover

Data model changes

  • chunks — add content_hash column if not present (SHA-256 of chunk content, computed on write) + UNIQUE(document_id, content_hash) (RT-12).
  • documents — add units_total INT, finalize_enqueued_at TIMESTAMPTZ (fan-in). status/progress columns already exist.
  • document_units (new) — (document_id UUID, unit_id TEXT, extracted_at TIMESTAMPTZ, UNIQUE(document_id, unit_id)).

Interfaces / contracts

ProducerConsumerInterfaceShape
BrowserMain LambdaPOST /api/kb/[kbId]/upload-url{ filename, fileSize } → { uploadUrl, objectKey, documentId }
BrowserS3 uploadsPresigned PUT (≤100MB, content-type pinned)binary
S3SQS ingest-jobsObjectCreated{ documentId, organizationId, kbId, sourceObjectKey, sourceExt, extractorModel }
PrepSQS extract-jobsenqueue per unit{ documentId, organizationId, kbId, unitId, extractorModel }
ExtractSQS finalize-jobsenqueue once (CAS-guarded){ documentId, organizationId, kbId }
WorkerRDSINSERT/UPDATE documents, pages, chunks, document_unitsDrizzle schema (+ new cols)
WorkerS3 cachePutObjectpage PNGs + parse + groupings
CloudFront /api/cache/*S3 cacheGET originbinary / JSON

Implementation plan (phased — each independently deployable + reversible)

All pipeline infra lives in a dedicated AutriIngestion CDK stack inside autri-infra (not a separate app — see § Deployment Topology & Reversibility), deployable in isolation via cdk deploy AutriIngestion.

  • Phase 0 (CDK + migrations): the AutriIngestion stack — 3 SQS queues + 3 DLQs, IAM roles (Extract: SDK secret + cache write + RDS; Prep: uploads read + cache write + RDS; Finalize: RDS + cache + OpenAI egress), the janitor Lambda + its schedule — plus a small touch on the web stack for the presign IAM grant on Main Lambda. Schema migrations run through the existing db/ migration path (not a new CDK custom resource): chunks content_hash + UNIQUE, documents columns, document_units table. (Native SQS event-source mappings move to Phase 1 — they require the Prep/Extract/Finalize functions, which don't exist until the worker image; see Amendments.)
  • Phase 1 (worker image + functions): build + push the container image; register Prep/Extract/Finalize functions + native SQS event-source mappings + Extract reserved concurrency = R. Implement the SDK extractor (lift cli-client.ts interface; replace spawn("claude") with client.messages.create + tool-use loop; port the D31 circuit breaker to SDK error codes; mark the static prompt prefix with cache_control per the MVP prompt-caching item).
  • Phase 2 (cache read path): /api/cache/* CloudFront behavior → S3 origin. This change lives in the web stack (it owns the distribution), pointing at the cache bucket (in NetworkAndData). One invalidation; empty bucket regenerates from worker output.
  • Phase 3 (cutover): replace Main Lambda's inline stageFiles + fire-and-forget runIngestionPipeline with presign + row-create + the S3-event path, behind INGESTION_RUNTIME=worker|inline. Flip-to-inline is the rollback.

Deployment Topology & Reversibility

Pipeline infra = a dedicated AutriIngestion stack inside the autri-infra CDK app — not a separate app/repo. The pipeline shares VPC, RDS, the uploads + cache buckets, and the Anthropic secret (all in NetworkAndData), and the cache-read path touches the web CloudFront distribution. A separate CDK app would force brittle cross-app ARN export/import wiring for every one of those dependencies; a separate stack gives the same wins without it:

  • Isolated, fast iterationcdk deploy AutriIngestion synthesizes + deploys only the pipeline.
  • Blast-radius safety — a bad pipeline deploy can't break the web or auth stacks (separate CloudFormation stacks).
  • Type-safe shared refs — VPC / RDS / buckets / secret come in as construct references, not hand-wired ARNs.
  • Matches how autri-infra is already organized (per-concern stacks: NetworkAndData, web, auth-and-compute).

Ownership seams (each stack owns its own resources):

  • AutriIngestion: SQS queues + DLQs, the 3 Lambda functions + event-source mappings + reserved concurrency, the janitor, pipeline IAM roles.
  • NetworkAndData: VPC, RDS, uploads + cache buckets, Anthropic secret.
  • Web stack: the /api/cache/* CloudFront behavior (it owns the distribution) + the presign IAM grant on Main Lambda.
  • Schema migrations: the existing db/ path, not CDK custom resources.

Stages are pure functions; the runtime is a thin shell (this is what keeps the Fargate pivot cheap). Each stage is a pure function — runRender(doc), runParse(doc), … runExtract(unit), runFinalize(doc) — with no Lambda/SQS coupling inside it. The Lambda handlers (and the janitor) are thin shells: SQS wiring, the fan-in marker writes, and status updates live in the shell, not the stage. Two payoffs:

  • Testability — every stage is unit-testable with no AWS in the loop (and the fan-in test in the DoD lives at the shell layer).
  • Cheap Fargate pivot (pre-authorized fallback) — if the all-Lambda fan-in proves too painful in practice, pivot back to a single Fargate worker by swapping the shell for an in-process loop that calls the same stage functions. No stage rewrite; the data plane (S3 + SQS + RDS + cache) is unchanged; INGESTION_RUNTIME=worker|inline is the cutover seam. Take the Lambda path, and if it fights us, fall back without losing the stage work.

Definition of Done

  • A PDF and a DOCX uploaded via the browser both reach status='ready'; chunks are queryable; page renders serve from S3 in the inspector.
  • D13: presign rejects a kbId the caller's org doesn't own.
  • Idempotency test passes: a redelivered extract-job produces no duplicate chunks and finalize fires exactly once.
  • A forced extract failure DLQs after 3 receives + alarms; the runbook one-liner replays it.
  • A doc whose units never complete is re-surfaced by the janitor within ~40 min.
  • Per-doc extraction cost is visible in logs (feeds EPIC-5 cost telemetry).

Open action items

  • BT-2 — RESOLVED (2026-05-28): autri's Anthropic org is Tier 1. Haiku limits: 50 RPM, 50K ITPM, 10K OTPM. The binding constraint for extract is tokens-per-minute, not requests — and OTPM (10K) is the harder ceiling (each call emits a groupings JSON; caching can't relieve output). So Extract Lambda reserved concurrency R starts low (~3) and tunes up from telemetry, nowhere near 50 RPM. The all-Lambda reserved-concurrency design absorbs this: overflow waits in SQS, no rate-limit errors blow up docs.
  • Tier 2 (auto-unlocks at $40 cumulative spend — beta crosses it fast) lifts both ceilings materially. Re-tune R upward when it does.
  • Credit top-up (new): org balance ~$2.30 ≈ one Bible-sized doc ($2–3). Top up / enable auto-reload before EPIC-4.5 Phase-1 ingestion testing. Phase 0 (CDK + migrations) needs no credits.
  • Prompt caching for MVP — RESOLVED (2026-05-28): pulled from v1.1 into MVP (Dan's call at kickoff). The Tier-1 ITPM limit excludes cache reads, so caching the static grouping-prompt prefix is a throughput lever (relieves the 50K ITPM ceiling), on top of the ~10–20% cost cut; it does NOT relieve the 10K OTPM ceiling. Implemented in Phase 1 when the SDK extractor is written fresh (cheap to add then — ~1–2 lines). Caveat: verify the static prefix clears Haiku's 2048-token cache floor. See § Amendments + MVP scope.

Blue-teamed 2026-05-28. Compute locked (all-Lambda). BT-2 resolved (Tier 1). Prompt caching amended into MVP. Ready for EPIC-4.5 implementation, Phase 0 first.

Amendments

Scope changes after the contract was first locked. Logged here per methodology (amendments are visible, not silent).

DateAmendmentRationale
2026-05-28Prompt caching on the SDK extractor pulled from v1.1 → MVP.At Tier 1, ITPM excludes cache reads, so caching the static grouping-prompt prefix is a throughput lever (relieves the 50K ITPM ceiling), not just the ~10–20% cost cut. The SDK extractor is written fresh in Phase 1, so cache_control is ~1–2 lines added then — cheap now, not a separate pass later. Caveat: verify the static prefix clears Haiku's 2048-token cache floor (cache_creation_input_tokens / cache_read_input_tokens); does NOT relieve OTPM. Dan's call at EPIC-4.5 kickoff.
2026-05-28Native SQS event-source mappings moved Phase 0 → Phase 1.An event-source mapping needs a target Lambda; the Prep/Extract/Finalize functions don't exist until the Phase 1 worker image. Phase 0 ships the queues + DLQs + IAM roles + janitor; Phase 1 attaches the mappings when the functions register. Sequencing only, not a scope change.
2026-05-28RT-12 chunk idempotency: content_hash UNIQUE → per-unit replace (migration 009).content_hash as the idempotency key is doubly wrong: (1) a redelivered extract-job re-runs the non-deterministic LLM, so the re-extracted chunks differ → the UNIQUE never catches the duplicate (it fails at its job); (2) legitimate duplicate-content chunks (refrains, boilerplate, identical table rows) false-collide → silent drop or doc failure. Fix: drop the UNIQUE, add chunks.unit_index, and have the Extract worker DELETE a unit's chunks before re-inserting — idempotent regardless of LLM output, no false collisions. content_hash stays as a non-unique column for embed change-detection. The related document-level versioning + incremental re-ingestion concern is drafted in sub-systems/incremental-re-ingestion (v1.1).

Review

🔒

Enter your access token to view annotations