Epic: Batch Ingestion (async grouping via the Message Batches API)

DRAFT — for red-team next session. Routes llm-grouped ingestion units through the Anthropic Message Batches API to cut ingest cost ~50% and (at our current rate-limit tier) reduce latency. Validated in principle by the S6 A/B; this epic builds the path. Drafted 2026-06-06.

Overview

What we're building, why it's worth it, and what we're deliberately leaving out.

Goals & Non-Goals

Goals:

Send llm-routed units (prose/verse grouping, and figure-vision) through the Message Batches API instead of synchronous per-unit calls.
Realize the S6-measured win: ~50% cost cut, and ~2× lower latency for grouping-heavy docs (the latency win is structural — batch runs on a separate rate-limit bucket, bypassing our Tier-1 ITPM ceiling).
Local/CLI path first (validate the full ingest-through-batch flow on real docs end-to-end), then production async orchestration.
Keep chunking output byte-identical — this is a scheduling/billing change, not a quality change.

Non-Goals:

Not batching code-routed units (deterministic, instant, ~$0 — nothing to gain).
Not prompt caching (S6: inert below the 4096 floor, doesn't stack with batch — skipped).
Not the small-batches-for-progress experiment (future; this epic does one batch per doc).
Not changing the router's code vs llm decision — batch is a new sink for llm units, not a new routing rule.

Problem Statement

S6 proved batch is ~50% cheaper and (at Tier 1) faster for llm-grouped docs — but the win requires a different control flow than today's synchronous per-unit fan-out. Today each unit is its own SQS job that makes a blocking LLM call inline. Batch is submit → poll → process-all: you can't make a blocking call, and results aren't retrievable until the whole batch ends. This epic builds that async path without regressing the synchronous one (which stays the right choice for code-routed and small docs).

What Is This Epic?

A new asynchronous ingestion lane: collect a doc's llm-routed units, submit them as one Message Batch, poll for completion, then run the existing post-LLM processing (parse grouping → write chunks) over the results, then finalize. The cost-accounting groundwork already shipped in the cost-observability epic (pricing.ts BATCH_MULT + computeIngestCost({batch:true}) + the per-doc cost columns), so the dollars side is ready; the work is the orchestration.

Context

How this sits on top of the current pipeline and what it touches.

Dependents

Pricing & plan design (Lever C, north-star): lower ingest COGS directly changes how many docs a tier can profitably allow. Batch is a primary COGS lever.

Dependencies

Cost-observability epic (SHIPPED): provides batch-rate cost accounting (computeIngestCost({batch:true})) and the per-doc cost columns the batch path writes.
route.ts code vs llm (shipped): determines which units are batch-eligible.
The ingestion worker (prep → extract → finalize) and extractStructuralUnit — the synchronous path batch must compose with, not replace.

Current State

Synchronous only: prep enqueues one SQS extract-job per unit; extract calls extractStructuralUnit (blocking LLM call inline, writes chunks, atomic cost+progress increments, fan-in marker → tryClaimFinalize); finalize links + embeds + computes cost.
S6 A/B (real-path replay, 2026-06-06): docx novel 26 calls $0.683/201s → batch $0.358/91s; Genesis 45 calls $0.780/192s → $0.386/61s; SRWF26 structured 10 calls $0.093/23s → $0.046/45s. Cost ~50% across the board; latency win on the two grouping-heavy docs, slight loss on the small structured doc. See docs/s6-batch-caching-spike.md.
Cost columns + BATCH_MULT already shipped — the batch path records computeIngestCost({batch:true}).
Progress today: progress_done/progress_total bumped per unit; PipelineLive polls every 2.5s and renders a filling bar. Batch can't drive an incremental bar (see Edge Cases).

Affected Systems

System / Layer	How It's Affected
`extractStructuralUnit` (`extractor.ts`)	Refactor into build-request / make-call / process-result so batch results can drive `process-result` (the chunk-writing) without re-calling the model
`sdk-client.ts`	Expose the request-params builder (already mostly factored) so the batch submitter reuses the exact production request
`ingestion/cli.ts`	New `--batch` mode for `extract` (Phase 1 local validation)
`ingestion-worker/`	New async orchestration: a submit step, a poller, a process-results step; `finalize` fan-in must wait on a pending batch + any `code` units
New: a batch poller	EventBridge Scheduler / Step Functions / SQS-delay — the key design choice (red-team)
`app/` progress UI	Indeterminate "Processing…" state for batched docs

Design

The shape of the change — local-first, then the async orchestration that's the real work.

Data Model Changes

documents: batch_id text, batch_submitted_at timestamptz, and a status value (or column) for batch_pending. (Red-team: reuse status enum vs a separate column.)

Approach

Refactor extractStructuralUnit into three seams: buildGroupingRequest(unit) → params, callModel(params) → response (sync today), processGroupingResult(unit, result) → chunks (the existing post-LLM logic). Both the sync path and the batch path share build + process; only the call differs.
Phase 1 — local/CLI batch (pnpm ingest extract <slug> --batch): collect all llm units → one batches.create → poll (retrieve until ended) → processGroupingResult per result → finalize. In-process polling is fine locally. This proves the full flow on real docs (and re-confirms the S6 numbers end-to-end, not just via replay).
Phase 2 — production async orchestration: the hard part. prep splits units; code units process immediately (sync, instant); llm units are submitted as one batch and the doc enters a batch_pending state. A poller checks the batch; on ended, a process step runs processGroupingResult for each unit, then finalize (which now waits on: all code units done and the batch processed). Poller mechanism is the keystone red-team question — EventBridge Scheduler firing a poll Lambda, a Step Functions wait-state, or an SQS message with DelaySeconds self-redrive.
Cost: record computeIngestCost({batch:true}) for batched units (already supported).
Progress: batched docs show an indeterminate state, not a filling bar (research-confirmed: request_counts is pinned until the batch ends).

API / Interface Changes

extractStructuralUnit decomposed (above); buildGroupingRequest + processGroupingResult exported.
A submitBatch(units) → batchId + processBatchResults(batchId) pair in the worker.
documents gains a batch-lifecycle state (e.g. batch_id, batch_status/batch_submitted_at) — migration TBD.

Edge Cases & Gotchas

The async lane introduces failure modes the synchronous path doesn't have.

Scenario	Expected Behavior	Why It's Tricky
Batch stuck / exceeds SLA	Max-latency fallback (e.g. after 30–60 min: cancel + re-run those units synchronously, or flag the doc)	Batch SLA is "<1h" but not guaranteed; a stuck batch must never strand a doc in `processing` forever
Partial batch failure (some requests `errored`)	Continue-on-error per the ingestion ethos; failed units retried sync or flagged	Batch reports per-request errors only at the end
Mixed-routing doc (`code` + `llm` units)	`code` units finish instantly; `llm` units batch; finalize waits for both	Fan-in must track two different completion signals
Doc with only 1–2 `llm` units	Maybe skip batch (sync is already fast; batch adds queue overhead)	Need a threshold — batch isn't worth it below N units (S6: the 10-call structured doc lost to sync)
Figure-vision units	Decide: batch them too, or keep vision sync?	Vision is `llm` and batchable, but it's a smaller surface; may not be worth the added complexity
Progress during batch	Indeterminate "Processing…" + bell on done	`request_counts` can't drive an incremental bar (contractual)
Lambda fire-and-forget death	The poller must be durable (not an in-Lambda loop)	A blocking poll in a Lambda will time out / die — hence the external poller

Stories

Draft slate — red-team will reshape. Phased so we validate locally before the expensive async build.

Story	Summary	Status
S1	Refactor `extractStructuralUnit` into build / call / process seams (no behavior change; gate-green)	Not started
S2	Local/CLI batch path (`--batch`) + validate on the 3 S6 corpus docs end-to-end (confirm ~50% + correctness)	Not started
S3	Production async orchestration (poller + submit/process steps + mixed-routing fan-in) — the keystone; red-team target	Not started
S4	Batch-eligibility policy (threshold on `llm`-unit count; figure-vision batch yes/no)	Not started
S5	Max-latency fallback + partial-failure handling	Not started
S6	Indeterminate progress UI state for batched docs	Not started

Decisions Log

Provisional — confirm/revise in red-team.

Date	Decision	Rationale	Alternatives
2026-06-06	Batch only `llm`-routed units	`code` units are instant/$0 — nothing to gain	Batch everything (no benefit, more complexity)
2026-06-06	Local/CLI path first, then prod async	Cheap end-to-end validation before the expensive orchestration ([[methodology]] manual-first)	Build prod orchestration directly (higher risk, unvalidated)
2026-06-06	Accept indeterminate progress for batched docs	`request_counts` is contractually pinned; bell covers "done"	Block batch on a progress solution (the small-batches experiment is the future option)
2026-06-06	Skip prompt caching	S6: inert below 4096 floor; doesn't stack with batch	Grow the prefix for caching (low $ value, prompt-quality risk)

Risks

Risk	Likelihood	Impact	Mitigation
Async orchestration complexity (serverless poller)	High	Slow/buggy build	Local path first; pick the simplest durable poller; lean on Step Functions if hand-rolled polling gets hairy
Stuck/expired batch strands a doc	Medium	Doc never finalizes	Max-latency fallback (S5) — cancel + sync-retry or flag
Latency win is tier-dependent	Medium	Win shrinks if we raise the ITPM tier	Cost win is unconditional; treat latency as a bonus, not the justification
Batch cost accounting drifts from sync	Low	Wrong displayed cost	Reuse `computeIngestCost({batch:true})` — already the single source

Open Questions (red-team targets)

The things to attack next session, before any code:

Poller mechanism — EventBridge Scheduler poll-Lambda vs Step Functions wait-state vs SQS DelaySeconds self-redrive? (Cost, complexity, the Lambda fire-and-forget trap.)
Mixed-routing fan-in — how does finalize cleanly wait on both the code units and the pending batch?
Batch-eligibility threshold — below how many llm units is sync still better? (S6 says small docs lose.)
Figure-vision — batch the vision calls too, or keep them sync?
Progress UX — is an indeterminate spinner acceptable for beta, or do we need the small-batches coarse-progress approach sooner than "future"?
Does this block on, or compose with, the existing progress-bar work (#41)?
Genesis routing surprise — should PDF scripture route deterministic instead? If so, that's a separate win that shrinks the batch surface for verse (handle independently).

Epic: Batch Ingestion (async grouping via the Message Batches API)#

Overview#

Goals & Non-Goals#

Problem Statement#

What Is This Epic?#

Context#

Dependents#

Dependencies#

Current State#

Affected Systems#

Design#

Data Model Changes#

Approach#

API / Interface Changes#

Edge Cases & Gotchas#

Stories#

Decisions Log#

Risks#

Open Questions (red-team targets)#

Review