Epic: Batch Ingestion (async grouping via the Message Batches API)
DRAFT — for red-team next session. Routes llm-grouped ingestion units through the Anthropic Message Batches API to cut ingest cost ~50% and (at our current rate-limit tier) reduce latency. Validated in principle by the S6 A/B; this epic builds the path. Drafted 2026-06-06.
Overview
What we're building, why it's worth it, and what we're deliberately leaving out.
Goals & Non-Goals
Goals:
- Send
llm-routed units (prose/verse grouping, and figure-vision) through the Message Batches API instead of synchronous per-unit calls. - Realize the S6-measured win: ~50% cost cut, and ~2× lower latency for grouping-heavy docs (the latency win is structural — batch runs on a separate rate-limit bucket, bypassing our Tier-1 ITPM ceiling).
- Local/CLI path first (validate the full ingest-through-batch flow on real docs end-to-end), then production async orchestration.
- Keep chunking output byte-identical — this is a scheduling/billing change, not a quality change.
Non-Goals:
- Not batching
code-routed units (deterministic, instant, ~$0 — nothing to gain). - Not prompt caching (S6: inert below the 4096 floor, doesn't stack with batch — skipped).
- Not the small-batches-for-progress experiment (future; this epic does one batch per doc).
- Not changing the router's
codevsllmdecision — batch is a new sink forllmunits, not a new routing rule.
Problem Statement
S6 proved batch is ~50% cheaper and (at Tier 1) faster for llm-grouped docs — but the win requires a different control flow than today's synchronous per-unit fan-out. Today each unit is its own SQS job that makes a blocking LLM call inline. Batch is submit → poll → process-all: you can't make a blocking call, and results aren't retrievable until the whole batch ends. This epic builds that async path without regressing the synchronous one (which stays the right choice for code-routed and small docs).
What Is This Epic?
A new asynchronous ingestion lane: collect a doc's llm-routed units, submit them as one Message Batch, poll for completion, then run the existing post-LLM processing (parse grouping → write chunks) over the results, then finalize. The cost-accounting groundwork already shipped in the cost-observability epic (pricing.ts BATCH_MULT + computeIngestCost({batch:true}) + the per-doc cost columns), so the dollars side is ready; the work is the orchestration.
Context
How this sits on top of the current pipeline and what it touches.
Dependents
- Pricing & plan design (Lever C, north-star): lower ingest COGS directly changes how many docs a tier can profitably allow. Batch is a primary COGS lever.
Dependencies
- Cost-observability epic (SHIPPED): provides batch-rate cost accounting (
computeIngestCost({batch:true})) and the per-doc cost columns the batch path writes. route.tscodevsllm(shipped): determines which units are batch-eligible.- The ingestion worker (
prep → extract → finalize) andextractStructuralUnit— the synchronous path batch must compose with, not replace.
Current State
- Synchronous only:
prepenqueues one SQSextract-jobper unit;extractcallsextractStructuralUnit(blocking LLM call inline, writes chunks, atomic cost+progress increments, fan-in marker →tryClaimFinalize);finalizelinks + embeds + computes cost. - S6 A/B (real-path replay, 2026-06-06): docx novel 26 calls $0.683/201s → batch $0.358/91s; Genesis 45 calls $0.780/192s → $0.386/61s; SRWF26 structured 10 calls $0.093/23s → $0.046/45s. Cost ~50% across the board; latency win on the two grouping-heavy docs, slight loss on the small structured doc. See
docs/s6-batch-caching-spike.md. - Cost columns +
BATCH_MULTalready shipped — the batch path recordscomputeIngestCost({batch:true}). - Progress today:
progress_done/progress_totalbumped per unit;PipelineLivepolls every 2.5s and renders a filling bar. Batch can't drive an incremental bar (see Edge Cases).
Affected Systems
| System / Layer | How It's Affected |
|---|---|
extractStructuralUnit (extractor.ts) | Refactor into build-request / make-call / process-result so batch results can drive process-result (the chunk-writing) without re-calling the model |
sdk-client.ts | Expose the request-params builder (already mostly factored) so the batch submitter reuses the exact production request |
ingestion/cli.ts | New --batch mode for extract (Phase 1 local validation) |
ingestion-worker/ | New async orchestration: a submit step, a poller, a process-results step; finalize fan-in must wait on a pending batch + any code units |
| New: a batch poller | EventBridge Scheduler / Step Functions / SQS-delay — the key design choice (red-team) |
app/ progress UI | Indeterminate "Processing…" state for batched docs |
Design
The shape of the change — local-first, then the async orchestration that's the real work.
Data Model Changes
documents:batch_id text,batch_submitted_at timestamptz, and a status value (or column) forbatch_pending. (Red-team: reusestatusenum vs a separate column.)
Approach
- Refactor
extractStructuralUnitinto three seams:buildGroupingRequest(unit) → params,callModel(params) → response(sync today),processGroupingResult(unit, result) → chunks(the existing post-LLM logic). Both the sync path and the batch path share build + process; only the call differs. - Phase 1 — local/CLI batch (
pnpm ingest extract <slug> --batch): collect allllmunits → onebatches.create→ poll (retrieveuntilended) →processGroupingResultper result → finalize. In-process polling is fine locally. This proves the full flow on real docs (and re-confirms the S6 numbers end-to-end, not just via replay). - Phase 2 — production async orchestration: the hard part.
prepsplits units;codeunits process immediately (sync, instant);llmunits are submitted as one batch and the doc enters abatch_pendingstate. A poller checks the batch; onended, a process step runsprocessGroupingResultfor each unit, thenfinalize(which now waits on: allcodeunits done and the batch processed). Poller mechanism is the keystone red-team question — EventBridge Scheduler firing a poll Lambda, a Step Functions wait-state, or an SQS message withDelaySecondsself-redrive. - Cost: record
computeIngestCost({batch:true})for batched units (already supported). - Progress: batched docs show an indeterminate state, not a filling bar (research-confirmed:
request_countsis pinned until the batch ends).
API / Interface Changes
extractStructuralUnitdecomposed (above);buildGroupingRequest+processGroupingResultexported.- A
submitBatch(units) → batchId+processBatchResults(batchId)pair in the worker. documentsgains a batch-lifecycle state (e.g.batch_id,batch_status/batch_submitted_at) — migration TBD.
Edge Cases & Gotchas
The async lane introduces failure modes the synchronous path doesn't have.
| Scenario | Expected Behavior | Why It's Tricky |
|---|---|---|
| Batch stuck / exceeds SLA | Max-latency fallback (e.g. after 30–60 min: cancel + re-run those units synchronously, or flag the doc) | Batch SLA is "<1h" but not guaranteed; a stuck batch must never strand a doc in processing forever |
Partial batch failure (some requests errored) | Continue-on-error per the ingestion ethos; failed units retried sync or flagged | Batch reports per-request errors only at the end |
Mixed-routing doc (code + llm units) | code units finish instantly; llm units batch; finalize waits for both | Fan-in must track two different completion signals |
Doc with only 1–2 llm units | Maybe skip batch (sync is already fast; batch adds queue overhead) | Need a threshold — batch isn't worth it below N units (S6: the 10-call structured doc lost to sync) |
| Figure-vision units | Decide: batch them too, or keep vision sync? | Vision is llm and batchable, but it's a smaller surface; may not be worth the added complexity |
| Progress during batch | Indeterminate "Processing…" + bell on done | request_counts can't drive an incremental bar (contractual) |
| Lambda fire-and-forget death | The poller must be durable (not an in-Lambda loop) | A blocking poll in a Lambda will time out / die — hence the external poller |
Stories
Draft slate — red-team will reshape. Phased so we validate locally before the expensive async build.
| Story | Summary | Status |
|---|---|---|
| S1 | Refactor extractStructuralUnit into build / call / process seams (no behavior change; gate-green) | Not started |
| S2 | Local/CLI batch path (--batch) + validate on the 3 S6 corpus docs end-to-end (confirm ~50% + correctness) | Not started |
| S3 | Production async orchestration (poller + submit/process steps + mixed-routing fan-in) — the keystone; red-team target | Not started |
| S4 | Batch-eligibility policy (threshold on llm-unit count; figure-vision batch yes/no) | Not started |
| S5 | Max-latency fallback + partial-failure handling | Not started |
| S6 | Indeterminate progress UI state for batched docs | Not started |
Decisions Log
Provisional — confirm/revise in red-team.
| Date | Decision | Rationale | Alternatives |
|---|---|---|---|
| 2026-06-06 | Batch only llm-routed units | code units are instant/$0 — nothing to gain | Batch everything (no benefit, more complexity) |
| 2026-06-06 | Local/CLI path first, then prod async | Cheap end-to-end validation before the expensive orchestration ([[methodology]] manual-first) | Build prod orchestration directly (higher risk, unvalidated) |
| 2026-06-06 | Accept indeterminate progress for batched docs | request_counts is contractually pinned; bell covers "done" | Block batch on a progress solution (the small-batches experiment is the future option) |
| 2026-06-06 | Skip prompt caching | S6: inert below 4096 floor; doesn't stack with batch | Grow the prefix for caching (low $ value, prompt-quality risk) |
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Async orchestration complexity (serverless poller) | High | Slow/buggy build | Local path first; pick the simplest durable poller; lean on Step Functions if hand-rolled polling gets hairy |
| Stuck/expired batch strands a doc | Medium | Doc never finalizes | Max-latency fallback (S5) — cancel + sync-retry or flag |
| Latency win is tier-dependent | Medium | Win shrinks if we raise the ITPM tier | Cost win is unconditional; treat latency as a bonus, not the justification |
| Batch cost accounting drifts from sync | Low | Wrong displayed cost | Reuse computeIngestCost({batch:true}) — already the single source |
Open Questions (red-team targets)
The things to attack next session, before any code:
- Poller mechanism — EventBridge Scheduler poll-Lambda vs Step Functions wait-state vs SQS
DelaySecondsself-redrive? (Cost, complexity, the Lambda fire-and-forget trap.) - Mixed-routing fan-in — how does
finalizecleanly wait on both thecodeunits and the pending batch? - Batch-eligibility threshold — below how many
llmunits is sync still better? (S6 says small docs lose.) - Figure-vision — batch the vision calls too, or keep them sync?
- Progress UX — is an indeterminate spinner acceptable for beta, or do we need the small-batches coarse-progress approach sooner than "future"?
- Does this block on, or compose with, the existing progress-bar work (#41)?
- Genesis routing surprise — should PDF scripture route deterministic instead? If so, that's a separate win that shrinks the batch surface for verse (handle independently).