Foundry Foundry

Epic: Batch Ingestion (async grouping via the Message Batches API)

DRAFT — for red-team next session. Routes llm-grouped ingestion units through the Anthropic Message Batches API to cut ingest cost ~50% and (at our current rate-limit tier) reduce latency. Validated in principle by the S6 A/B; this epic builds the path. Drafted 2026-06-06.

Overview

What we're building, why it's worth it, and what we're deliberately leaving out.

Goals & Non-Goals

Goals:

  • Send llm-routed units (prose/verse grouping, and figure-vision) through the Message Batches API instead of synchronous per-unit calls.
  • Realize the S6-measured win: ~50% cost cut, and ~2× lower latency for grouping-heavy docs (the latency win is structural — batch runs on a separate rate-limit bucket, bypassing our Tier-1 ITPM ceiling).
  • Local/CLI path first (validate the full ingest-through-batch flow on real docs end-to-end), then production async orchestration.
  • Keep chunking output byte-identical — this is a scheduling/billing change, not a quality change.

Non-Goals:

  • Not batching code-routed units (deterministic, instant, ~$0 — nothing to gain).
  • Not prompt caching (S6: inert below the 4096 floor, doesn't stack with batch — skipped).
  • Not the small-batches-for-progress experiment (future; this epic does one batch per doc).
  • Not changing the router's code vs llm decision — batch is a new sink for llm units, not a new routing rule.

Problem Statement

S6 proved batch is ~50% cheaper and (at Tier 1) faster for llm-grouped docs — but the win requires a different control flow than today's synchronous per-unit fan-out. Today each unit is its own SQS job that makes a blocking LLM call inline. Batch is submit → poll → process-all: you can't make a blocking call, and results aren't retrievable until the whole batch ends. This epic builds that async path without regressing the synchronous one (which stays the right choice for code-routed and small docs).

What Is This Epic?

A new asynchronous ingestion lane: collect a doc's llm-routed units, submit them as one Message Batch, poll for completion, then run the existing post-LLM processing (parse grouping → write chunks) over the results, then finalize. The cost-accounting groundwork already shipped in the cost-observability epic (pricing.ts BATCH_MULT + computeIngestCost({batch:true}) + the per-doc cost columns), so the dollars side is ready; the work is the orchestration.

Context

How this sits on top of the current pipeline and what it touches.

Dependents

  • Pricing & plan design (Lever C, north-star): lower ingest COGS directly changes how many docs a tier can profitably allow. Batch is a primary COGS lever.

Dependencies

  • Cost-observability epic (SHIPPED): provides batch-rate cost accounting (computeIngestCost({batch:true})) and the per-doc cost columns the batch path writes.
  • route.ts code vs llm (shipped): determines which units are batch-eligible.
  • The ingestion worker (prep → extract → finalize) and extractStructuralUnit — the synchronous path batch must compose with, not replace.

Current State

  • Synchronous only: prep enqueues one SQS extract-job per unit; extract calls extractStructuralUnit (blocking LLM call inline, writes chunks, atomic cost+progress increments, fan-in marker → tryClaimFinalize); finalize links + embeds + computes cost.
  • S6 A/B (real-path replay, 2026-06-06): docx novel 26 calls $0.683/201s → batch $0.358/91s; Genesis 45 calls $0.780/192s → $0.386/61s; SRWF26 structured 10 calls $0.093/23s → $0.046/45s. Cost ~50% across the board; latency win on the two grouping-heavy docs, slight loss on the small structured doc. See docs/s6-batch-caching-spike.md.
  • Cost columns + BATCH_MULT already shipped — the batch path records computeIngestCost({batch:true}).
  • Progress today: progress_done/progress_total bumped per unit; PipelineLive polls every 2.5s and renders a filling bar. Batch can't drive an incremental bar (see Edge Cases).

Affected Systems

System / LayerHow It's Affected
extractStructuralUnit (extractor.ts)Refactor into build-request / make-call / process-result so batch results can drive process-result (the chunk-writing) without re-calling the model
sdk-client.tsExpose the request-params builder (already mostly factored) so the batch submitter reuses the exact production request
ingestion/cli.tsNew --batch mode for extract (Phase 1 local validation)
ingestion-worker/New async orchestration: a submit step, a poller, a process-results step; finalize fan-in must wait on a pending batch + any code units
New: a batch pollerEventBridge Scheduler / Step Functions / SQS-delay — the key design choice (red-team)
app/ progress UIIndeterminate "Processing…" state for batched docs

Design

The shape of the change — local-first, then the async orchestration that's the real work.

Data Model Changes

  • documents: batch_id text, batch_submitted_at timestamptz, and a status value (or column) for batch_pending. (Red-team: reuse status enum vs a separate column.)

Approach

  1. Refactor extractStructuralUnit into three seams: buildGroupingRequest(unit) → params, callModel(params) → response (sync today), processGroupingResult(unit, result) → chunks (the existing post-LLM logic). Both the sync path and the batch path share build + process; only the call differs.
  2. Phase 1 — local/CLI batch (pnpm ingest extract <slug> --batch): collect all llm units → one batches.create → poll (retrieve until ended) → processGroupingResult per result → finalize. In-process polling is fine locally. This proves the full flow on real docs (and re-confirms the S6 numbers end-to-end, not just via replay).
  3. Phase 2 — production async orchestration: the hard part. prep splits units; code units process immediately (sync, instant); llm units are submitted as one batch and the doc enters a batch_pending state. A poller checks the batch; on ended, a process step runs processGroupingResult for each unit, then finalize (which now waits on: all code units done and the batch processed). Poller mechanism is the keystone red-team question — EventBridge Scheduler firing a poll Lambda, a Step Functions wait-state, or an SQS message with DelaySeconds self-redrive.
  4. Cost: record computeIngestCost({batch:true}) for batched units (already supported).
  5. Progress: batched docs show an indeterminate state, not a filling bar (research-confirmed: request_counts is pinned until the batch ends).

API / Interface Changes

  • extractStructuralUnit decomposed (above); buildGroupingRequest + processGroupingResult exported.
  • A submitBatch(units) → batchId + processBatchResults(batchId) pair in the worker.
  • documents gains a batch-lifecycle state (e.g. batch_id, batch_status/batch_submitted_at) — migration TBD.

Edge Cases & Gotchas

The async lane introduces failure modes the synchronous path doesn't have.

ScenarioExpected BehaviorWhy It's Tricky
Batch stuck / exceeds SLAMax-latency fallback (e.g. after 30–60 min: cancel + re-run those units synchronously, or flag the doc)Batch SLA is "<1h" but not guaranteed; a stuck batch must never strand a doc in processing forever
Partial batch failure (some requests errored)Continue-on-error per the ingestion ethos; failed units retried sync or flaggedBatch reports per-request errors only at the end
Mixed-routing doc (code + llm units)code units finish instantly; llm units batch; finalize waits for bothFan-in must track two different completion signals
Doc with only 1–2 llm unitsMaybe skip batch (sync is already fast; batch adds queue overhead)Need a threshold — batch isn't worth it below N units (S6: the 10-call structured doc lost to sync)
Figure-vision unitsDecide: batch them too, or keep vision sync?Vision is llm and batchable, but it's a smaller surface; may not be worth the added complexity
Progress during batchIndeterminate "Processing…" + bell on donerequest_counts can't drive an incremental bar (contractual)
Lambda fire-and-forget deathThe poller must be durable (not an in-Lambda loop)A blocking poll in a Lambda will time out / die — hence the external poller

Stories

Draft slate — red-team will reshape. Phased so we validate locally before the expensive async build.

StorySummaryStatus
S1Refactor extractStructuralUnit into build / call / process seams (no behavior change; gate-green)Not started
S2Local/CLI batch path (--batch) + validate on the 3 S6 corpus docs end-to-end (confirm ~50% + correctness)Not started
S3Production async orchestration (poller + submit/process steps + mixed-routing fan-in) — the keystone; red-team targetNot started
S4Batch-eligibility policy (threshold on llm-unit count; figure-vision batch yes/no)Not started
S5Max-latency fallback + partial-failure handlingNot started
S6Indeterminate progress UI state for batched docsNot started

Decisions Log

Provisional — confirm/revise in red-team.

DateDecisionRationaleAlternatives
2026-06-06Batch only llm-routed unitscode units are instant/$0 — nothing to gainBatch everything (no benefit, more complexity)
2026-06-06Local/CLI path first, then prod asyncCheap end-to-end validation before the expensive orchestration ([[methodology]] manual-first)Build prod orchestration directly (higher risk, unvalidated)
2026-06-06Accept indeterminate progress for batched docsrequest_counts is contractually pinned; bell covers "done"Block batch on a progress solution (the small-batches experiment is the future option)
2026-06-06Skip prompt cachingS6: inert below 4096 floor; doesn't stack with batchGrow the prefix for caching (low $ value, prompt-quality risk)

Risks

RiskLikelihoodImpactMitigation
Async orchestration complexity (serverless poller)HighSlow/buggy buildLocal path first; pick the simplest durable poller; lean on Step Functions if hand-rolled polling gets hairy
Stuck/expired batch strands a docMediumDoc never finalizesMax-latency fallback (S5) — cancel + sync-retry or flag
Latency win is tier-dependentMediumWin shrinks if we raise the ITPM tierCost win is unconditional; treat latency as a bonus, not the justification
Batch cost accounting drifts from syncLowWrong displayed costReuse computeIngestCost({batch:true}) — already the single source

Open Questions (red-team targets)

The things to attack next session, before any code:

  1. Poller mechanism — EventBridge Scheduler poll-Lambda vs Step Functions wait-state vs SQS DelaySeconds self-redrive? (Cost, complexity, the Lambda fire-and-forget trap.)
  2. Mixed-routing fan-in — how does finalize cleanly wait on both the code units and the pending batch?
  3. Batch-eligibility threshold — below how many llm units is sync still better? (S6 says small docs lose.)
  4. Figure-vision — batch the vision calls too, or keep them sync?
  5. Progress UX — is an indeterminate spinner acceptable for beta, or do we need the small-batches coarse-progress approach sooner than "future"?
  6. Does this block on, or compose with, the existing progress-bar work (#41)?
  7. Genesis routing surprise — should PDF scripture route deterministic instead? If so, that's a separate win that shrinks the batch surface for verse (handle independently).

Review

🔒

Enter your access token to view annotations