Epic: Ingestion Cost Observability (+ Batch Spike)
Make ingestion (and query) cost visible and persisted, so the beta is a self-measuring pricing instrument. Drafted 2026-06-06; refined 2026-06-06 (pre-ship code-audit pass — see Decisions Log 2026-06-06b rows).
Testing Strategy
How each layer is verified — the pure math is gated; the DB/UI parts are manual by design.
Design
The shape of the change: bottom-up instrumentation, one pricing module, atomic per-doc accumulation.
Context
How this fits the surrounding system and what it touches.
Overview
What we're building and why, plus what we're deliberately leaving out.
Goals & Non-Goals
Goals:
- Persist per-document ingest cost with a stage breakdown (Haiku grouping / figure vision / embeddings), computed from recorded tokens.
- Show the cost in the inspector UI after a doc is ingested — the user (and we) can see exactly what an upload cost.
- Track per-query cost so the beta accumulates the other half of COGS (ingest + query). The real query path is the production chat (
/api/chat, AI SDKstreamText, runs on Sonnet) — instrument there, not the dev-only playground. - Spike the Batch API and prompt caching together — measure the batch latency/discount delta AND whether growing the static prefix past Haiku's 4096-token cacheable floor yields net savings + ITPM relief, plus how the two interact. Produce a go/no-go, don't assert savings.
Non-Goals:
- No usage caps or tiers for beta. We measure first; limits come later from real beta data.
- Not adopting the Batch API in prod this epic — S6 is a spike that produces a recommendation, not a wired-in batch path.
- Not turning caching into a claimed cost lever this epic — at the current ~3K prefix it's a confirmed no-op (below the 4096 floor); whether to grow the prefix is what S6 measures.
- Not building the deterministic-fallback "overflow" route — it already exists (route
code); wiring it as a quota fallback is future work once caps exist.
Problem Statement
We are about to put the beta in front of real users and we cannot price what we cannot see. Today ingest cost is logged to stdout and never persisted — there is no per-doc cost, no breakdown, nothing in the UI, no way to model a $10/mo plan's limits. Worse, a recent investigation corrected a load-bearing assumption: real prose uploads are chunked by Haiku (LLM grouping), not the cheap embedding/deterministic path the eval corpus used (see [[unified-chunking-markdown]]). So prose ingest carries a real per-doc LLM cost we have never measured in situ. The beta should be the measuring stick.
What Is This Epic?
A thin observability layer over the existing ingestion pipeline: a single pricing module, per-doc cost persistence (stage-broken-down, computed from tokens so it's correct whether dev runs the Max CLI or prod runs the SDK), an inspector cost display, query-cost tracking, and a combined batch+caching spike. It adds no new chunking behavior — it instruments what already runs.
Dependents
- Pricing & plan design (post-beta): the
$10/moindividual plan's word/page quota and any premium/cheap tiering will be set from the cost data this epic collects. Blocked until we have real numbers. - Usage caps / deterministic-overflow routing (future): depends on per-doc cost existing to enforce against.
Dependencies
- The ingestion worker (
prep → extract → finalize) and the extractor clients (SDK / CLI) — already shipped. route.tsrouting (codevsllm) — already shipped; determines whether a doc pays Haiku grouping at all.
Current State
(Verified against the code 2026-06-06.)
- Cost is logged to stdout only, never persisted.
documentshas no cost columns (latest migration012).chat_queriesstoresresponse_tokens(output only) +latency_msbut no cost and no input tokens. extractStructuralUnitreturnsrecordedCostUsdper unit (the grouping call).sdk-client.tsalready capturesinput_tokens/output_tokens/cache_creation_input_tokens/cache_read_input_tokensfrom the SDKusageobject and computes cost via an inlinePRICEconstant (to be consolidated intopricing.ts). The CLI path returns onlytotal_cost_usd(no per-token breakdown).embedDocumentreturnsrecordedTokens.- Figure-vision cost is currently discarded. The figure pass (
figure-pass.tsdetectFigures) returns onlyFigureRefT[]; inextractor.tsits unit result is hard-codedrecordedCostUsd: 0("logged in the client usage line"). So vision tokens have no persisted source today — S2 must change this. - Prod cost reality (the corrected model): structured docs route to
code→ ~$0 text grouping (only figure-vision + embeddings); prose routes tollm→ Haiku grouping (the real variable cost). Haiku 4.5 =$1/Min,$5/Mout ($0.50/$2.50on the Batch API).text-embedding-3-small=$0.02/M(negligible, <2%). Chat/query runs on a different model — Sonnet (claude-sonnet-4-5) via the AI SDK — so query cost needs the chat model's rates, not Haiku's. - Prompt caching is in the code but inert.
cache_control: ephemeralis already on the grouping system block; Haiku 4.5's minimum cacheable prefix is 4096 tokens and ours is only ~2.6–3K, socache_readis always 0. Not a cost lever until the prefix grows past the floor (S6 measures whether that's worth it; its original rationale was ITPM rate-limit relief, not cost). - Measured cost model (reference): prose ≈
$0.0024/1k wordsstandard ($0.0008projected with batch+cache, unconfirmed until S6). Short story ≈$0.015; ~40k-word doc ≈$0.12; full ~170k-word novel ≈$0.34. A 50-page structured rule book ≈$0.05all-in. At$10/mobreakeven that's ~28 full novels / ~81 medium docs / hundreds of structured docs — ingest is not the cost threat; bulk prose upload + query volume are.
Affected Systems
| System / Layer | How It's Affected |
|---|---|
ingestion/ extractor + embed | Surface tokens from sdk-client/cli-client/embed up through results; change figure-pass/detectFigures to return its usage so vision is a real stage |
ingestion-worker/ (extract, finalize) | Aggregate stage tokens (grouping and vision, separately) and write per-doc cost via atomic increments; finalize adds embed tokens + computes the final ingest_cost_usd |
db/migrations/ + app/lib/db/schema.ts | New migration 013 — per-doc cost columns + chat_queries cost columns. Must update BOTH the raw-SQL migration AND the Drizzle schema (both documents and chat_queries are defined in Drizzle) or typed inserts won't see the columns |
app/ inspector | Render the ingest cost + stage breakdown on the document |
app/api/chat (the prod query path) | In the existing onFinish({usage}), compute query cost from Sonnet rates via pricing.ts and persist input_tokens + cost_usd to chat_queries |
ingestion/pricing.ts (new) | Single source of truth for all rates — Haiku in/out + cache/batch multipliers, the chat model (Sonnet) in/out rates for query cost, and embedding rate |
Approach
Instrument bottom-up: (1) pricing.ts + tests (consolidating the inline sdk-client PRICE); (2) thread tokens through the extractor/embed results — including surfacing figure-pass vision tokens as a separate stage — and write them via atomic increments in the worker, computing the doc total at finalize; (3) expose + render in the inspector; (4) persist query cost from the real chat path (/api/chat onFinish, Sonnet rates); (5) the combined Batch+caching spike (S6) is investigative and gated behind its own measurement — see Stories. Verify by ingesting the demo pair via the SDK path locally (real ANTHROPIC_API_KEY, small spend) so the stage breakdown actually populates — the dev Max-CLI path only yields a lump total_cost_usd, no per-stage split. One prose doc (route llm, shows Haiku grouping tokens) + one structured PDF (route code, ~$0 grouping + figure-vision only) — the contrast is the demo.
API / Interface Changes
ingestion/pricing.ts(new) — the single source of truth:- Rate constants: Haiku in/out, cached-read (
0.1×) / cache-write (1.25×), batch (0.5×), the chat model (Sonnet) in/out rates, and the embedding rate. computeIngestCost(tokens: IngestTokens): number— pure, deterministic, unit-tested.computeQueryCost(tokens: QueryTokens): number— pure, Sonnet-rated, unit-tested.- Consolidates the existing inline
PRICEconstant insdk-client.ts(whichsdk-clientthen imports) so there is exactly one rate table.
- Rate constants: Haiku in/out, cached-read (
ExtractUnitResultgains a grouping token set (inputTokens/cachedInputTokens/outputTokens, SDK path) and a separate vision token set (visionInputTokens/visionOutputTokens) sourced from the (newly token-returning) figure pass. CLI path falls back torecordedCostUsdonly when token counts aren't available.detectFigures()return type gains its usage (input/output tokens) so the worker can attribute vision cost.- Document detail API response gains
ingestCostUsd+ the stage breakdown. chat_queriesinsert in/api/chatonFinishgainsinput_tokens+cost_usd(computed viacomputeQueryCost).
Key Algorithms / Logic
- Cost from tokens, not from the envelope. We persist token counts per stage and compute USD via
pricing.ts. This makes the displayed cost (a) identical across the dev Max-CLI and prod SDK, and (b) auto-correct if we flip caching/batch on. CLI-onlyrecordedCostUsdis stored as a fallback when tokens are coarse. - Figure-vision is its own stage. The grouping call and the figure-vision call are both Haiku but are separate API calls within one unit's extraction. Their tokens are kept apart (
vision_*columns) so the structured-doc breakdown — which is almost entirely vision — is legible and not conflated with grouping. - Atomic accumulation across parallel units. Extraction runs as one worker invocation per structural unit, in parallel (one SQS
extract-jobper unit). Per-doc totals must accumulate without lost updates: each unit handler doesUPDATE documents SET haiku_input_tokens = haiku_input_tokens + $delta, vision_input_tokens = vision_input_tokens + $delta, ...(atomic increment), keyed by stage.finalizeaddsembed_tokensand writes the computedingest_cost_usd. - Query cost on the real path.
/api/chatalready runsstreamText(Sonnet) and writes achat_queriesrow inonFinish({usage}). Extend that same write to computecost_usdfromusage.inputTokens/usage.outputTokensat Sonnet rates and persistinput_tokens. The dev playground (route(), CLI envelope) is secondary — leave as-is.
Data Model Changes
// pricing.ts
interface IngestTokens {
haikuInputTokens: number;
haikuCachedInputTokens: number; // cached-read, billed at 0.1×
haikuOutputTokens: number;
visionInputTokens: number;
visionOutputTokens: number;
embedTokens: number;
}
interface QueryTokens {
inputTokens: number; // Sonnet-rated
outputTokens: number;
}
Migration 013 — update both the raw-SQL migration and the Drizzle schema (app/lib/db/schema.ts):
- On
documents(or a 1:1document_ingest_cost):ingest_cost_usd numeric,haiku_input_tokens,haiku_cached_input_tokens,haiku_output_tokens,vision_input_tokens,vision_output_tokens,embed_tokens(allbigint default 0). - On
chat_queries:input_tokens int,cost_usd numeric.
Edge Cases & Gotchas
| Scenario | Expected Behavior | Why It's Tricky |
|---|---|---|
| Parallel unit extraction writing the same doc's cost | No lost updates | Must use atomic += increments, not read-modify-write |
| Dev runs Max CLI (cost "absorbed", no token breakdown) | UI total is prod-accurate; stage split only populates on the SDK path | Compute from tokens via pricing.ts; CLI gives a lump total_cost_usd only → verify the breakdown by running the SDK path locally |
| CLI path lacks per-token counts | Store recordedCostUsd fallback for the total | SDK gives tokens; CLI envelope gives cost — total still displays, breakdown does not |
| Figure-vision call vs grouping call | Counted as separate stages | Both are Haiku, but separate API calls; figure-pass must be changed to return usage (today it's discarded as recordedCostUsd: 0) |
Structured doc (route code) | Shows ~$0 grouping + figure-vision only | Cost source differs by route — the breakdown must make that legible; the vision stage carries it |
| Query runs on Sonnet, ingest on Haiku | Query cost uses Sonnet rates | Two models, two rate sets in pricing.ts; don't price a query at Haiku rates |
| Prompt caching cache-read vs write | Inert today (sub-4096 prefix) | cache_read is always 0 until the prefix grows past Haiku's 4096 floor — S6 decides whether to grow it |
| Migration touches Drizzle-managed tables | New columns appear in both raw SQL and schema.ts | documents + chat_queries are Drizzle-defined; a raw-SQL-only migration leaves typed inserts blind to the columns |
Stories
| Story | Summary | Status | PR |
|---|---|---|---|
| S1 | Pricing module — ingestion/pricing.ts rate constants (Haiku + Sonnet/chat + embedding + cache/batch multipliers) + computeIngestCost and computeQueryCost (pure, unit-tested); consolidate the inline sdk-client PRICE | Not started | |
| S2 | Persist per-doc ingest cost — migration 013 (raw SQL + Drizzle schema); thread stage tokens through extractor/embed; change figure-pass/detectFigures to return usage so vision is a separate stage; atomic-increment per stage in extract, finalize-compute in finalize | Not started | |
| S3 | Inspector cost display — doc API + UI cost line with stage breakdown (Haiku grouping / figure vision / embed) on hover | Not started | |
| S4 | Query cost tracking — in /api/chat onFinish, compute query cost_usd from Sonnet rates via pricing.ts and persist it + input_tokens to chat_queries (the prod path; playground route() left as-is) | Not started | |
| S6 | Batch API + prompt-caching spike — measure batch latency/discount AND grow-prefix-past-4096 + caching savings + ITPM relief, and how they interact; produce a go/no-go (see below) | Not started |
(Former S5 "turn caching on" is folded into S6 — at the current prefix size it's a no-op, so it becomes a measured lever in the spike rather than a standalone story.)
S6 — Batch + caching spike (detail). Questions to answer, not code to ship:
- Batch latency: real turnaround for a doc's units submitted as one batch vs the current synchronous per-unit path (median + worst case). Minutes or hours in practice?
- Progress UX: the app tracks
progress_done / progress_total / units_totaland renders a progress bar (per [[autri-progress-bar]] work). Batch returns units all-at-once rather than incrementally — how do we represent progress? Options: poll batch status → synthetic progress; "queued/processing/done" states; hybrid (sync the first N units for motion, batch the tail). - Caching savings: does growing the static prefix past Haiku 4.5's 4096-token cacheable floor net out positive? Prefix repeats per unit, so cache-read (
0.1×) replaces full input (1×) on units 2..N within the 5-min TTL — worth ~70% off the prefix portion (~10–20% off the grouping call, scaling with unit count). Caveat: grow the prefix with content that also helps extraction (clearer schema docs / a few-shot example) and re-run the chunking eval to confirm no regression — don't pad with filler. The larger prize may be ITPM rate-limit relief (cache reads are excluded from the input-TPM ceiling — the cause of the prod 429s under concurrent uploads), not the dollars. - Interaction: does caching survive batch mode? The cache TTL is 5 minutes but batch jobs process asynchronously over minutes-to-hours — the first unit's cache-write may expire before the tail units run, so cache reads might not land in batch at all. Measure whether batch + caching actually stack or are mutually exclusive in practice.
Success criteria: a one-page recommendation with measured numbers (batch latency, batch discount realized, caching savings + whether it stacks with batch, ITPM-relief observed) and a progress-UX proposal — enough to decide whether to wire batch and/or grow-the-prefix-caching into prod (likely a later epic).
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-06 | No usage caps/tiers for beta; measure instead | Can't price what we can't see; beta = measuring stick | Premium/cheap tiers now (cost savings too small to justify the surface) |
| 2026-06-06 | Cost shown in the UI per doc, stage-broken-down | Direct ask — Dan wants to literally see upload cost; feeds pricing model | Log-only / dashboard-only |
| 2026-06-06 | Per-doc total + stage breakdown granularity | Enough for the UI badge + pricing analysis without per-unit write volume | Per-doc only (less insight); per-unit + per-doc (more schema/writes) |
| 2026-06-06 | Compute cost from tokens via a pricing module | Consistent across Max-CLI (dev) and SDK (prod); auto-updates with caching/batch | Trust the CLI/SDK envelope cost directly (drifts across clients) |
| 2026-06-06 | Keep ingest synchronous for beta; batch is a spike only | Immediate cost feedback in the inspector; latency UX unknown | Adopt Batch API now (adds latency, muddies the "see cost after upload" UX) |
| 2026-06-06 | Track query cost too | Beta must measure total COGS (ingest + query), not just ingest | Ingest-only tracking |
| 2026-06-06b | Query cost on the prod /api/chat path, Sonnet rates | That's the real product query path (streamText, already writes chat_queries); it runs on Sonnet, not Haiku | Instrument the dev route() playground (not real traffic; CLI envelope cost); both paths (second rate source, low value) |
| 2026-06-06b | Figure-vision tokens surfaced as a separate stage | Structured docs (the pilot) are almost entirely figure-vision cost; today it's discarded (recordedCostUsd: 0) so the breakdown would read ~$0 | Defer vision tracking (structured docs under-report); lump vision into the Haiku bucket (can't separate grouping from figure cost) |
| 2026-06-06b | Caching folded into the batch spike (S6), not a standalone "free savings" story | It's a confirmed no-op below the 4096 floor; whether to grow the prefix is a measured trade-off (savings + ITPM relief vs prompt-quality risk), and it interacts with batch's TTL | Keep S5 as a separate "turn caching on" story (phantom savings); keep cache_control as a documented no-op (no action) |
| 2026-06-06b | Verify the stage breakdown by running the SDK path locally | Dev Max-CLI returns only a lump cost with no per-token breakdown; the SDK path populates the real stage split | Accept lump cost in dev (breakdown unverified locally); back out synthetic tokens from CLI cost (approximate/hacky) |
| 2026-06-06b | pricing.ts is the single rate table; consolidate the inline sdk-client PRICE | Avoid two rate sets drifting (the exact risk this log already flags) | Add pricing.ts alongside the existing inline PRICE (two sources of truth) |
| 2026-06-06b | Migration 013 updates both raw SQL and the Drizzle schema | documents + chat_queries are Drizzle-defined; raw-SQL-only leaves typed inserts blind to the new columns | Raw-SQL migration only (typed inserts can't write the columns) |
Test Layers
| Layer | Applies? | Notes |
|---|---|---|
| Unit tests | Yes | pricing.ts computeIngestCost + computeQueryCost — rate math (Haiku, Sonnet, cached/batch flags), zero/edge inputs. Pure → runs in the gate. |
| Integration (DB) | Manual | Ingest a doc locally via the SDK path → assert per-doc cost rows, per-stage token columns (incl. vision), and atomic accumulation across parallel units. Needs Postgres + ANTHROPIC_API_KEY (not gated). |
| Integration (UI) | Manual | Inspector shows the cost line + breakdown; verify via headless preview (the /hl:ship QA backend). |
| End-to-end | Manual | Ingest one prose doc (route llm, Haiku grouping stage) + one structured PDF (route code, vision stage carries it) via the SDK path; confirm the cost contrast renders correctly. |
Required Fixtures
| Fixture Name | What It Tests | Priority |
|---|---|---|
pricing.computeIngestCost cases | Rate math: standard, cached-read discount, batch discount, vision + grouping + embed mixed-stage totals | 🔴 High |
pricing.computeQueryCost cases | Sonnet in/out rate math; zero/edge inputs | 🔴 High |
| prose-vs-structured ingest pair (SDK path) | Route-dependent cost source (Haiku grouping vs figure-vision-only) renders correctly | 🟡 Medium |
Verification Rules
computeIngestCostandcomputeQueryCostmust have unit tests covering standard, cached, batch, and the Sonnet query path — these are the numbers users see.- Atomic-increment accumulation must be verified under parallel unit writes (no lost updates), including the separate vision stage.
- Displayed cost must be computed from tokens, validated against hand-computed cost for a known doc ingested via the SDK path.
- Cost-affecting changes (caching, batch, rate constants, model swaps) require a re-verification of a known doc's displayed cost.
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Lost updates accumulating per-doc cost across parallel units | Medium | Wrong cost shown | Atomic += increments keyed by stage; never read-modify-write |
| Dev (Max CLI) shows no stage breakdown | High | Breakdown unverified locally | Verify via the SDK path locally; CLI path stores the lump cost fallback |
Figure-vision cost stays at $0 if figure-pass isn't changed | High | Structured-doc breakdown reads ~$0 (misleading) | S2 explicitly changes detectFigures to return usage; vision gets its own stage/column |
| Query priced at the wrong (Haiku) rate | Medium | Wrong query COGS | pricing.ts carries the Sonnet rate; computeQueryCost is the only path |
| Batch latency breaks the progress-bar UX | Medium | Poor ingest UX if adopted | S6 spike measures it first; batch stays out of prod until resolved |
| Caching savings overstated / doesn't survive batch | Medium | Phantom savings in the model | S6 measures the 4096-floor savings AND the TTL-vs-batch-latency interaction before any claim |
| Query cost unbounded by any ingest quota | Medium | COGS surprise from chatty users | Track it in beta (this epic); set query limits later from data |
| Rate constants drift as Anthropic pricing changes | Low | Stale displayed cost | Centralized in pricing.ts; one edit updates everything |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| Deterministic-overflow routing not wired | Low | Route code exists and ties Haiku except on dialogue; wire as a quota fallback once caps exist (future epic) |
| Batch API not in prod | Low | Deferred by decision; S6 produces the go/no-go |
| Prompt caching inert at current prefix size | Low | cache_control is in place but sub-4096; S6 decides whether to grow the prefix (cost + ITPM relief vs prompt-quality risk) |
Dev playground query cost (route()) not tracked | Low | Intentional — playground isn't real traffic; only /api/chat query cost is persisted |
S6 Results — Batch + Caching Spike (2026-06-06)
Verdict (REVISED after a 3-doc A/B): adopt batch for llm-routed docs — it's ~50% cheaper AND ~2× faster than the production sync path. Skip caching. Full writeup: docs/s6-batch-caching-spike.md (repo). Spend: ~$2.5 across the spike + A/B.
The 3-document batch-vs-sync A/B
Each doc ran through the real SDK pipeline at production concurrency (3) with every request captured, then those exact requests were replayed as one batch. Faithful replay, not reconstruction.
| Doc (type) | LLM calls | Sync cost | Batch cost | Saved | Sync latency | Batch latency |
|---|---|---|---|---|---|---|
| docx novel (prose → Haiku grouping) | 26 | $0.683 | $0.358 | ~48% | 201s | 91s |
| Genesis (PDF → Haiku grouping*) | 45 | $0.780 | $0.386 | ~50% | 192s | 61s |
| SRWF26 Technical (structured + figures) | 2 + 8 vision | $0.093 | $0.046 | ~51% | 23s | 45s |
Findings
- Batch is faster for big docs, not slower (the assumption-flip). Batch requests run on a separate rate-limit bucket, sidestepping our Tier-1 ITPM ceiling (50K) that caps sync at ~3 concurrent. Cost win is unconditional; latency win holds while sync is ITPM-throttled.
- Savings concentrate by routing. Prose/verse → Haiku grouping (large batchable surface, real $); structured → mostly
code(tiny surface, ~$0.05 even at 50%). Route prose+verse to batch; keep structured on sync. - The only UX cost is the progress bar.
request_countsis contractually pinned ({processing: all, succeeded: 0}until the batch ends — no incremental status, no webhooks; research-confirmed). Batched docs show an indeterminate "Processing…"; the bell/notification model already covers "done." - Caching is the weaker lever — skip it. Static prefix is 3,014 tokens, below Haiku 4.5's 4,096 floor (
cache_read=0, measured); growing past it saves ~$0.02/doc and doesn't stack with batch (cache TTL expires before async batch runs). Inertcache_controlleft in place as a documented no-op. - Surprise (separate follow-up): the PDF Bible (Genesis) routed to Haiku grouping, not the deterministic verse path — 45 LLM calls, $0.78. A raw PDF doesn't recover authored verse boundaries → falls to the LLM. If PDF scripture should route deterministic, that's a bigger win than batch for that type.
Follow-ups
- Open a "batch ingestion" epic: new batch sink for
llm-routed units; indeterminate progress state; batch max-latency fallback (SLA <1h but not guaranteed — a stuck batch must not strand a doc). - Investigate PDF-scripture routing (Genesis) — deterministic verse path vs LLM grouping.
- Future experiment (Dan): one-big-batch vs several-small-batches, to recover coarse progress granularity while keeping the discount.