Epic: Ingestion Cost Observability (+ Batch Spike)

Make ingestion (and query) cost visible and persisted, so the beta is a self-measuring pricing instrument. Drafted 2026-06-06; refined 2026-06-06 (pre-ship code-audit pass — see Decisions Log 2026-06-06b rows).

Testing Strategy

How each layer is verified — the pure math is gated; the DB/UI parts are manual by design.

Design

The shape of the change: bottom-up instrumentation, one pricing module, atomic per-doc accumulation.

Context

How this fits the surrounding system and what it touches.

Overview

What we're building and why, plus what we're deliberately leaving out.

Goals & Non-Goals

Goals:

Persist per-document ingest cost with a stage breakdown (Haiku grouping / figure vision / embeddings), computed from recorded tokens.
Show the cost in the inspector UI after a doc is ingested — the user (and we) can see exactly what an upload cost.
Track per-query cost so the beta accumulates the other half of COGS (ingest + query). The real query path is the production chat (/api/chat, AI SDK streamText, runs on Sonnet) — instrument there, not the dev-only playground.
Spike the Batch API and prompt caching together — measure the batch latency/discount delta AND whether growing the static prefix past Haiku's 4096-token cacheable floor yields net savings + ITPM relief, plus how the two interact. Produce a go/no-go, don't assert savings.

Non-Goals:

No usage caps or tiers for beta. We measure first; limits come later from real beta data.
Not adopting the Batch API in prod this epic — S6 is a spike that produces a recommendation, not a wired-in batch path.
Not turning caching into a claimed cost lever this epic — at the current ~3K prefix it's a confirmed no-op (below the 4096 floor); whether to grow the prefix is what S6 measures.
Not building the deterministic-fallback "overflow" route — it already exists (route code); wiring it as a quota fallback is future work once caps exist.

Problem Statement

We are about to put the beta in front of real users and we cannot price what we cannot see. Today ingest cost is logged to stdout and never persisted — there is no per-doc cost, no breakdown, nothing in the UI, no way to model a $10/mo plan's limits. Worse, a recent investigation corrected a load-bearing assumption: real prose uploads are chunked by Haiku (LLM grouping), not the cheap embedding/deterministic path the eval corpus used (see [[unified-chunking-markdown]]). So prose ingest carries a real per-doc LLM cost we have never measured in situ. The beta should be the measuring stick.

What Is This Epic?

A thin observability layer over the existing ingestion pipeline: a single pricing module, per-doc cost persistence (stage-broken-down, computed from tokens so it's correct whether dev runs the Max CLI or prod runs the SDK), an inspector cost display, query-cost tracking, and a combined batch+caching spike. It adds no new chunking behavior — it instruments what already runs.

Dependents

Pricing & plan design (post-beta): the $10/mo individual plan's word/page quota and any premium/cheap tiering will be set from the cost data this epic collects. Blocked until we have real numbers.
Usage caps / deterministic-overflow routing (future): depends on per-doc cost existing to enforce against.

Dependencies

The ingestion worker (prep → extract → finalize) and the extractor clients (SDK / CLI) — already shipped.
route.ts routing (code vs llm) — already shipped; determines whether a doc pays Haiku grouping at all.

Current State

(Verified against the code 2026-06-06.)

Cost is logged to stdout only, never persisted. documents has no cost columns (latest migration 012). chat_queries stores response_tokens (output only) + latency_ms but no cost and no input tokens.
extractStructuralUnit returns recordedCostUsd per unit (the grouping call). sdk-client.ts already captures input_tokens / output_tokens / cache_creation_input_tokens / cache_read_input_tokens from the SDK usage object and computes cost via an inline PRICE constant (to be consolidated into pricing.ts). The CLI path returns only total_cost_usd (no per-token breakdown). embedDocument returns recordedTokens.
Figure-vision cost is currently discarded. The figure pass (figure-pass.ts detectFigures) returns only FigureRefT[]; in extractor.ts its unit result is hard-coded recordedCostUsd: 0 ("logged in the client usage line"). So vision tokens have no persisted source today — S2 must change this.
Prod cost reality (the corrected model): structured docs route to code → ~$0 text grouping (only figure-vision + embeddings); prose routes to llm → Haiku grouping (the real variable cost). Haiku 4.5 = $1/M in, $5/M out ($0.50/$2.50 on the Batch API). text-embedding-3-small = $0.02/M (negligible, <2%). Chat/query runs on a different model — Sonnet (claude-sonnet-4-5) via the AI SDK — so query cost needs the chat model's rates, not Haiku's.
Prompt caching is in the code but inert. cache_control: ephemeral is already on the grouping system block; Haiku 4.5's minimum cacheable prefix is 4096 tokens and ours is only ~2.6–3K, so cache_read is always 0. Not a cost lever until the prefix grows past the floor (S6 measures whether that's worth it; its original rationale was ITPM rate-limit relief, not cost).
Measured cost model (reference): prose ≈ $0.0024/1k words standard ($0.0008 projected with batch+cache, unconfirmed until S6). Short story ≈ $0.015; ~40k-word doc ≈ $0.12; full ~170k-word novel ≈ $0.34. A 50-page structured rule book ≈ $0.05 all-in. At $10/mo breakeven that's ~28 full novels / ~81 medium docs / hundreds of structured docs — ingest is not the cost threat; bulk prose upload + query volume are.

Affected Systems

System / Layer	How It's Affected
`ingestion/` extractor + embed	Surface tokens from `sdk-client`/`cli-client`/`embed` up through results; change `figure-pass`/`detectFigures` to return its usage so vision is a real stage
`ingestion-worker/` (`extract`, `finalize`)	Aggregate stage tokens (grouping and vision, separately) and write per-doc cost via atomic increments; finalize adds embed tokens + computes the final `ingest_cost_usd`
`db/migrations/` + `app/lib/db/schema.ts`	New migration `013` — per-doc cost columns + `chat_queries` cost columns. Must update BOTH the raw-SQL migration AND the Drizzle schema (both `documents` and `chat_queries` are defined in Drizzle) or typed inserts won't see the columns
`app/` inspector	Render the ingest cost + stage breakdown on the document
`app/api/chat` (the prod query path)	In the existing `onFinish({usage})`, compute query cost from Sonnet rates via `pricing.ts` and persist `input_tokens` + `cost_usd` to `chat_queries`
`ingestion/pricing.ts` (new)	Single source of truth for all rates — Haiku in/out + cache/batch multipliers, the chat model (Sonnet) in/out rates for query cost, and embedding rate

Approach

Instrument bottom-up: (1) pricing.ts + tests (consolidating the inline sdk-client PRICE); (2) thread tokens through the extractor/embed results — including surfacing figure-pass vision tokens as a separate stage — and write them via atomic increments in the worker, computing the doc total at finalize; (3) expose + render in the inspector; (4) persist query cost from the real chat path (/api/chat onFinish, Sonnet rates); (5) the combined Batch+caching spike (S6) is investigative and gated behind its own measurement — see Stories. Verify by ingesting the demo pair via the SDK path locally (real ANTHROPIC_API_KEY, small spend) so the stage breakdown actually populates — the dev Max-CLI path only yields a lump total_cost_usd, no per-stage split. One prose doc (route llm, shows Haiku grouping tokens) + one structured PDF (route code, ~$0 grouping + figure-vision only) — the contrast is the demo.

API / Interface Changes

ingestion/pricing.ts (new) — the single source of truth:
- Rate constants: Haiku in/out, cached-read (0.1×) / cache-write (1.25×), batch (0.5×), the chat model (Sonnet) in/out rates, and the embedding rate.
- computeIngestCost(tokens: IngestTokens): number — pure, deterministic, unit-tested.
- computeQueryCost(tokens: QueryTokens): number — pure, Sonnet-rated, unit-tested.
- Consolidates the existing inline PRICE constant in sdk-client.ts (which sdk-client then imports) so there is exactly one rate table.
ExtractUnitResult gains a grouping token set (inputTokens / cachedInputTokens / outputTokens, SDK path) and a separate vision token set (visionInputTokens / visionOutputTokens) sourced from the (newly token-returning) figure pass. CLI path falls back to recordedCostUsd only when token counts aren't available.
detectFigures() return type gains its usage (input/output tokens) so the worker can attribute vision cost.
Document detail API response gains ingestCostUsd + the stage breakdown.
chat_queries insert in /api/chat onFinish gains input_tokens + cost_usd (computed via computeQueryCost).

Key Algorithms / Logic

Cost from tokens, not from the envelope. We persist token counts per stage and compute USD via pricing.ts. This makes the displayed cost (a) identical across the dev Max-CLI and prod SDK, and (b) auto-correct if we flip caching/batch on. CLI-only recordedCostUsd is stored as a fallback when tokens are coarse.
Figure-vision is its own stage. The grouping call and the figure-vision call are both Haiku but are separate API calls within one unit's extraction. Their tokens are kept apart (vision_* columns) so the structured-doc breakdown — which is almost entirely vision — is legible and not conflated with grouping.
Atomic accumulation across parallel units. Extraction runs as one worker invocation per structural unit, in parallel (one SQS extract-job per unit). Per-doc totals must accumulate without lost updates: each unit handler does UPDATE documents SET haiku_input_tokens = haiku_input_tokens + $delta, vision_input_tokens = vision_input_tokens + $delta, ... (atomic increment), keyed by stage. finalize adds embed_tokens and writes the computed ingest_cost_usd.
Query cost on the real path. /api/chat already runs streamText (Sonnet) and writes a chat_queries row in onFinish({usage}). Extend that same write to compute cost_usd from usage.inputTokens/usage.outputTokens at Sonnet rates and persist input_tokens. The dev playground (route(), CLI envelope) is secondary — leave as-is.

Data Model Changes

// pricing.ts
interface IngestTokens {
  haikuInputTokens: number;
  haikuCachedInputTokens: number; // cached-read, billed at 0.1×
  haikuOutputTokens: number;
  visionInputTokens: number;
  visionOutputTokens: number;
  embedTokens: number;
}

interface QueryTokens {
  inputTokens: number;   // Sonnet-rated
  outputTokens: number;
}

Migration 013 — update both the raw-SQL migration and the Drizzle schema (app/lib/db/schema.ts):

On documents (or a 1:1 document_ingest_cost): ingest_cost_usd numeric, haiku_input_tokens, haiku_cached_input_tokens, haiku_output_tokens, vision_input_tokens, vision_output_tokens, embed_tokens (all bigint default 0).
On chat_queries: input_tokens int, cost_usd numeric.

Edge Cases & Gotchas

Scenario	Expected Behavior	Why It's Tricky
Parallel unit extraction writing the same doc's cost	No lost updates	Must use atomic `+=` increments, not read-modify-write
Dev runs Max CLI (cost "absorbed", no token breakdown)	UI total is prod-accurate; stage split only populates on the SDK path	Compute from tokens via `pricing.ts`; CLI gives a lump `total_cost_usd` only → verify the breakdown by running the SDK path locally
CLI path lacks per-token counts	Store `recordedCostUsd` fallback for the total	SDK gives tokens; CLI envelope gives cost — total still displays, breakdown does not
Figure-vision call vs grouping call	Counted as separate stages	Both are Haiku, but separate API calls; figure-pass must be changed to return usage (today it's discarded as `recordedCostUsd: 0`)
Structured doc (route `code`)	Shows ~$0 grouping + figure-vision only	Cost source differs by route — the breakdown must make that legible; the vision stage carries it
Query runs on Sonnet, ingest on Haiku	Query cost uses Sonnet rates	Two models, two rate sets in `pricing.ts`; don't price a query at Haiku rates
Prompt caching cache-read vs write	Inert today (sub-4096 prefix)	`cache_read` is always 0 until the prefix grows past Haiku's 4096 floor — S6 decides whether to grow it
Migration touches Drizzle-managed tables	New columns appear in both raw SQL and `schema.ts`	`documents` + `chat_queries` are Drizzle-defined; a raw-SQL-only migration leaves typed inserts blind to the columns

Stories

Story	Summary	Status
S1	Pricing module — `ingestion/pricing.ts` rate constants (Haiku + Sonnet/chat + embedding + cache/batch multipliers) + `computeIngestCost` and `computeQueryCost` (pure, unit-tested); consolidate the inline `sdk-client` `PRICE`	Not started
S2	Persist per-doc ingest cost — migration `013` (raw SQL + Drizzle schema); thread stage tokens through extractor/embed; change `figure-pass`/`detectFigures` to return usage so vision is a separate stage; atomic-increment per stage in `extract`, finalize-compute in `finalize`	Not started
S3	Inspector cost display — doc API + UI cost line with stage breakdown (Haiku grouping / figure vision / embed) on hover	Not started
S4	Query cost tracking — in `/api/chat` `onFinish`, compute query `cost_usd` from Sonnet rates via `pricing.ts` and persist it + `input_tokens` to `chat_queries` (the prod path; playground `route()` left as-is)	Not started
S6	Batch API + prompt-caching spike — measure batch latency/discount AND grow-prefix-past-4096 + caching savings + ITPM relief, and how they interact; produce a go/no-go (see below)	Not started

(Former S5 "turn caching on" is folded into S6 — at the current prefix size it's a no-op, so it becomes a measured lever in the spike rather than a standalone story.)

S6 — Batch + caching spike (detail). Questions to answer, not code to ship:

Batch latency: real turnaround for a doc's units submitted as one batch vs the current synchronous per-unit path (median + worst case). Minutes or hours in practice?
Progress UX: the app tracks progress_done / progress_total / units_total and renders a progress bar (per [[autri-progress-bar]] work). Batch returns units all-at-once rather than incrementally — how do we represent progress? Options: poll batch status → synthetic progress; "queued/processing/done" states; hybrid (sync the first N units for motion, batch the tail).
Caching savings: does growing the static prefix past Haiku 4.5's 4096-token cacheable floor net out positive? Prefix repeats per unit, so cache-read (0.1×) replaces full input (1×) on units 2..N within the 5-min TTL — worth ~70% off the prefix portion (~10–20% off the grouping call, scaling with unit count). Caveat: grow the prefix with content that also helps extraction (clearer schema docs / a few-shot example) and re-run the chunking eval to confirm no regression — don't pad with filler. The larger prize may be ITPM rate-limit relief (cache reads are excluded from the input-TPM ceiling — the cause of the prod 429s under concurrent uploads), not the dollars.
Interaction: does caching survive batch mode? The cache TTL is 5 minutes but batch jobs process asynchronously over minutes-to-hours — the first unit's cache-write may expire before the tail units run, so cache reads might not land in batch at all. Measure whether batch + caching actually stack or are mutually exclusive in practice.

Success criteria: a one-page recommendation with measured numbers (batch latency, batch discount realized, caching savings + whether it stacks with batch, ITPM-relief observed) and a progress-UX proposal — enough to decide whether to wire batch and/or grow-the-prefix-caching into prod (likely a later epic).

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-06-06	No usage caps/tiers for beta; measure instead	Can't price what we can't see; beta = measuring stick	Premium/cheap tiers now (cost savings too small to justify the surface)
2026-06-06	Cost shown in the UI per doc, stage-broken-down	Direct ask — Dan wants to literally see upload cost; feeds pricing model	Log-only / dashboard-only
2026-06-06	Per-doc total + stage breakdown granularity	Enough for the UI badge + pricing analysis without per-unit write volume	Per-doc only (less insight); per-unit + per-doc (more schema/writes)
2026-06-06	Compute cost from tokens via a pricing module	Consistent across Max-CLI (dev) and SDK (prod); auto-updates with caching/batch	Trust the CLI/SDK envelope cost directly (drifts across clients)
2026-06-06	Keep ingest synchronous for beta; batch is a spike only	Immediate cost feedback in the inspector; latency UX unknown	Adopt Batch API now (adds latency, muddies the "see cost after upload" UX)
2026-06-06	Track query cost too	Beta must measure total COGS (ingest + query), not just ingest	Ingest-only tracking
2026-06-06b	Query cost on the prod `/api/chat` path, Sonnet rates	That's the real product query path (`streamText`, already writes `chat_queries`); it runs on Sonnet, not Haiku	Instrument the dev `route()` playground (not real traffic; CLI envelope cost); both paths (second rate source, low value)
2026-06-06b	Figure-vision tokens surfaced as a separate stage	Structured docs (the pilot) are almost entirely figure-vision cost; today it's discarded (`recordedCostUsd: 0`) so the breakdown would read ~$0	Defer vision tracking (structured docs under-report); lump vision into the Haiku bucket (can't separate grouping from figure cost)
2026-06-06b	Caching folded into the batch spike (S6), not a standalone "free savings" story	It's a confirmed no-op below the 4096 floor; whether to grow the prefix is a measured trade-off (savings + ITPM relief vs prompt-quality risk), and it interacts with batch's TTL	Keep S5 as a separate "turn caching on" story (phantom savings); keep cache_control as a documented no-op (no action)
2026-06-06b	Verify the stage breakdown by running the SDK path locally	Dev Max-CLI returns only a lump cost with no per-token breakdown; the SDK path populates the real stage split	Accept lump cost in dev (breakdown unverified locally); back out synthetic tokens from CLI cost (approximate/hacky)
2026-06-06b	`pricing.ts` is the single rate table; consolidate the inline `sdk-client` PRICE	Avoid two rate sets drifting (the exact risk this log already flags)	Add `pricing.ts` alongside the existing inline `PRICE` (two sources of truth)
2026-06-06b	Migration 013 updates both raw SQL and the Drizzle schema	`documents` + `chat_queries` are Drizzle-defined; raw-SQL-only leaves typed inserts blind to the new columns	Raw-SQL migration only (typed inserts can't write the columns)

Test Layers

Layer	Applies?	Notes
Unit tests	Yes	`pricing.ts` `computeIngestCost` + `computeQueryCost` — rate math (Haiku, Sonnet, cached/batch flags), zero/edge inputs. Pure → runs in the gate.
Integration (DB)	Manual	Ingest a doc locally via the SDK path → assert per-doc cost rows, per-stage token columns (incl. vision), and atomic accumulation across parallel units. Needs Postgres + `ANTHROPIC_API_KEY` (not gated).
Integration (UI)	Manual	Inspector shows the cost line + breakdown; verify via headless preview (the `/hl:ship` QA backend).
End-to-end	Manual	Ingest one prose doc (route `llm`, Haiku grouping stage) + one structured PDF (route `code`, vision stage carries it) via the SDK path; confirm the cost contrast renders correctly.

Required Fixtures

Fixture Name	What It Tests	Priority
`pricing.computeIngestCost` cases	Rate math: standard, cached-read discount, batch discount, vision + grouping + embed mixed-stage totals	🔴 High
`pricing.computeQueryCost` cases	Sonnet in/out rate math; zero/edge inputs	🔴 High
prose-vs-structured ingest pair (SDK path)	Route-dependent cost source (Haiku grouping vs figure-vision-only) renders correctly	🟡 Medium

Verification Rules

computeIngestCost and computeQueryCost must have unit tests covering standard, cached, batch, and the Sonnet query path — these are the numbers users see.
Atomic-increment accumulation must be verified under parallel unit writes (no lost updates), including the separate vision stage.
Displayed cost must be computed from tokens, validated against hand-computed cost for a known doc ingested via the SDK path.
Cost-affecting changes (caching, batch, rate constants, model swaps) require a re-verification of a known doc's displayed cost.

Risks

Risk	Likelihood	Impact	Mitigation
Lost updates accumulating per-doc cost across parallel units	Medium	Wrong cost shown	Atomic `+=` increments keyed by stage; never read-modify-write
Dev (Max CLI) shows no stage breakdown	High	Breakdown unverified locally	Verify via the SDK path locally; CLI path stores the lump cost fallback
Figure-vision cost stays at $0 if `figure-pass` isn't changed	High	Structured-doc breakdown reads ~$0 (misleading)	S2 explicitly changes `detectFigures` to return usage; vision gets its own stage/column
Query priced at the wrong (Haiku) rate	Medium	Wrong query COGS	`pricing.ts` carries the Sonnet rate; `computeQueryCost` is the only path
Batch latency breaks the progress-bar UX	Medium	Poor ingest UX if adopted	S6 spike measures it first; batch stays out of prod until resolved
Caching savings overstated / doesn't survive batch	Medium	Phantom savings in the model	S6 measures the 4096-floor savings AND the TTL-vs-batch-latency interaction before any claim
Query cost unbounded by any ingest quota	Medium	COGS surprise from chatty users	Track it in beta (this epic); set query limits later from data
Rate constants drift as Anthropic pricing changes	Low	Stale displayed cost	Centralized in `pricing.ts`; one edit updates everything

Known Issues / Tech Debt

Issue	Severity	Notes
Deterministic-overflow routing not wired	Low	Route `code` exists and ties Haiku except on dialogue; wire as a quota fallback once caps exist (future epic)
Batch API not in prod	Low	Deferred by decision; S6 produces the go/no-go
Prompt caching inert at current prefix size	Low	`cache_control` is in place but sub-4096; S6 decides whether to grow the prefix (cost + ITPM relief vs prompt-quality risk)
Dev playground query cost (`route()`) not tracked	Low	Intentional — playground isn't real traffic; only `/api/chat` query cost is persisted

S6 Results — Batch + Caching Spike (2026-06-06)

Verdict (REVISED after a 3-doc A/B): adopt batch for llm-routed docs — it's ~50% cheaper AND ~2× faster than the production sync path. Skip caching. Full writeup: docs/s6-batch-caching-spike.md (repo). Spend: ~$2.5 across the spike + A/B.

The 3-document batch-vs-sync A/B

Each doc ran through the real SDK pipeline at production concurrency (3) with every request captured, then those exact requests were replayed as one batch. Faithful replay, not reconstruction.

Doc (type)	LLM calls	Sync cost	Batch cost	Saved	Sync latency	Batch latency
docx novel (prose → Haiku grouping)	26	$0.683	$0.358	~48%	201s	91s
Genesis (PDF → Haiku grouping*)	45	$0.780	$0.386	~50%	192s	61s
SRWF26 Technical (structured + figures)	2 + 8 vision	$0.093	$0.046	~51%	23s	45s

Findings

Batch is faster for big docs, not slower (the assumption-flip). Batch requests run on a separate rate-limit bucket, sidestepping our Tier-1 ITPM ceiling (50K) that caps sync at ~3 concurrent. Cost win is unconditional; latency win holds while sync is ITPM-throttled.
Savings concentrate by routing. Prose/verse → Haiku grouping (large batchable surface, real $); structured → mostly code (tiny surface, ~$0.05 even at 50%). Route prose+verse to batch; keep structured on sync.
The only UX cost is the progress bar. request_counts is contractually pinned ({processing: all, succeeded: 0} until the batch ends — no incremental status, no webhooks; research-confirmed). Batched docs show an indeterminate "Processing…"; the bell/notification model already covers "done."
Caching is the weaker lever — skip it. Static prefix is 3,014 tokens, below Haiku 4.5's 4,096 floor (cache_read=0, measured); growing past it saves ~$0.02/doc and doesn't stack with batch (cache TTL expires before async batch runs). Inert cache_control left in place as a documented no-op.
Surprise (separate follow-up): the PDF Bible (Genesis) routed to Haiku grouping, not the deterministic verse path — 45 LLM calls, $0.78. A raw PDF doesn't recover authored verse boundaries → falls to the LLM. If PDF scripture should route deterministic, that's a bigger win than batch for that type.

Follow-ups

Open a "batch ingestion" epic: new batch sink for llm-routed units; indeterminate progress state; batch max-latency fallback (SLA <1h but not guaranteed — a stuck batch must not strand a doc).
Investigate PDF-scripture routing (Genesis) — deterministic verse path vs LLM grouping.
Future experiment (Dan): one-big-batch vs several-small-batches, to recover coarse progress granularity while keeping the discount.

Epic: Ingestion Cost Observability (+ Batch Spike)#

Testing Strategy#

Design#

Context#

Overview#

Goals & Non-Goals#

Problem Statement#

What Is This Epic?#

Dependents#

Dependencies#

Current State#

Affected Systems#

Approach#

API / Interface Changes#

Key Algorithms / Logic#

Data Model Changes#

Edge Cases & Gotchas#

Stories#

Decisions Log#

Test Layers#

Required Fixtures#

Verification Rules#

Risks#

Known Issues / Tech Debt#

S6 Results — Batch + Caching Spike (2026-06-06)#

The 3-document batch-vs-sync A/B#

Findings#

Follow-ups#

Review