Foundry Foundry

Epic: Ingestion Cost Observability (+ Batch Spike)

Make ingestion (and query) cost visible and persisted, so the beta is a self-measuring pricing instrument. Drafted 2026-06-06; refined 2026-06-06 (pre-ship code-audit pass — see Decisions Log 2026-06-06b rows).

Testing Strategy

How each layer is verified — the pure math is gated; the DB/UI parts are manual by design.

Design

The shape of the change: bottom-up instrumentation, one pricing module, atomic per-doc accumulation.

Context

How this fits the surrounding system and what it touches.

Overview

What we're building and why, plus what we're deliberately leaving out.

Goals & Non-Goals

Goals:

  • Persist per-document ingest cost with a stage breakdown (Haiku grouping / figure vision / embeddings), computed from recorded tokens.
  • Show the cost in the inspector UI after a doc is ingested — the user (and we) can see exactly what an upload cost.
  • Track per-query cost so the beta accumulates the other half of COGS (ingest + query). The real query path is the production chat (/api/chat, AI SDK streamText, runs on Sonnet) — instrument there, not the dev-only playground.
  • Spike the Batch API and prompt caching together — measure the batch latency/discount delta AND whether growing the static prefix past Haiku's 4096-token cacheable floor yields net savings + ITPM relief, plus how the two interact. Produce a go/no-go, don't assert savings.

Non-Goals:

  • No usage caps or tiers for beta. We measure first; limits come later from real beta data.
  • Not adopting the Batch API in prod this epic — S6 is a spike that produces a recommendation, not a wired-in batch path.
  • Not turning caching into a claimed cost lever this epic — at the current ~3K prefix it's a confirmed no-op (below the 4096 floor); whether to grow the prefix is what S6 measures.
  • Not building the deterministic-fallback "overflow" route — it already exists (route code); wiring it as a quota fallback is future work once caps exist.

Problem Statement

We are about to put the beta in front of real users and we cannot price what we cannot see. Today ingest cost is logged to stdout and never persisted — there is no per-doc cost, no breakdown, nothing in the UI, no way to model a $10/mo plan's limits. Worse, a recent investigation corrected a load-bearing assumption: real prose uploads are chunked by Haiku (LLM grouping), not the cheap embedding/deterministic path the eval corpus used (see [[unified-chunking-markdown]]). So prose ingest carries a real per-doc LLM cost we have never measured in situ. The beta should be the measuring stick.

What Is This Epic?

A thin observability layer over the existing ingestion pipeline: a single pricing module, per-doc cost persistence (stage-broken-down, computed from tokens so it's correct whether dev runs the Max CLI or prod runs the SDK), an inspector cost display, query-cost tracking, and a combined batch+caching spike. It adds no new chunking behavior — it instruments what already runs.

Dependents

  • Pricing & plan design (post-beta): the $10/mo individual plan's word/page quota and any premium/cheap tiering will be set from the cost data this epic collects. Blocked until we have real numbers.
  • Usage caps / deterministic-overflow routing (future): depends on per-doc cost existing to enforce against.

Dependencies

  • The ingestion worker (prep → extract → finalize) and the extractor clients (SDK / CLI) — already shipped.
  • route.ts routing (code vs llm) — already shipped; determines whether a doc pays Haiku grouping at all.

Current State

(Verified against the code 2026-06-06.)

  • Cost is logged to stdout only, never persisted. documents has no cost columns (latest migration 012). chat_queries stores response_tokens (output only) + latency_ms but no cost and no input tokens.
  • extractStructuralUnit returns recordedCostUsd per unit (the grouping call). sdk-client.ts already captures input_tokens / output_tokens / cache_creation_input_tokens / cache_read_input_tokens from the SDK usage object and computes cost via an inline PRICE constant (to be consolidated into pricing.ts). The CLI path returns only total_cost_usd (no per-token breakdown). embedDocument returns recordedTokens.
  • Figure-vision cost is currently discarded. The figure pass (figure-pass.ts detectFigures) returns only FigureRefT[]; in extractor.ts its unit result is hard-coded recordedCostUsd: 0 ("logged in the client usage line"). So vision tokens have no persisted source today — S2 must change this.
  • Prod cost reality (the corrected model): structured docs route to code~$0 text grouping (only figure-vision + embeddings); prose routes to llmHaiku grouping (the real variable cost). Haiku 4.5 = $1/M in, $5/M out ($0.50/$2.50 on the Batch API). text-embedding-3-small = $0.02/M (negligible, <2%). Chat/query runs on a different model — Sonnet (claude-sonnet-4-5) via the AI SDK — so query cost needs the chat model's rates, not Haiku's.
  • Prompt caching is in the code but inert. cache_control: ephemeral is already on the grouping system block; Haiku 4.5's minimum cacheable prefix is 4096 tokens and ours is only ~2.6–3K, so cache_read is always 0. Not a cost lever until the prefix grows past the floor (S6 measures whether that's worth it; its original rationale was ITPM rate-limit relief, not cost).
  • Measured cost model (reference): prose ≈ $0.0024/1k words standard ($0.0008 projected with batch+cache, unconfirmed until S6). Short story ≈ $0.015; ~40k-word doc ≈ $0.12; full ~170k-word novel ≈ $0.34. A 50-page structured rule book ≈ $0.05 all-in. At $10/mo breakeven that's ~28 full novels / ~81 medium docs / hundreds of structured docs — ingest is not the cost threat; bulk prose upload + query volume are.

Affected Systems

System / LayerHow It's Affected
ingestion/ extractor + embedSurface tokens from sdk-client/cli-client/embed up through results; change figure-pass/detectFigures to return its usage so vision is a real stage
ingestion-worker/ (extract, finalize)Aggregate stage tokens (grouping and vision, separately) and write per-doc cost via atomic increments; finalize adds embed tokens + computes the final ingest_cost_usd
db/migrations/ + app/lib/db/schema.tsNew migration 013 — per-doc cost columns + chat_queries cost columns. Must update BOTH the raw-SQL migration AND the Drizzle schema (both documents and chat_queries are defined in Drizzle) or typed inserts won't see the columns
app/ inspectorRender the ingest cost + stage breakdown on the document
app/api/chat (the prod query path)In the existing onFinish({usage}), compute query cost from Sonnet rates via pricing.ts and persist input_tokens + cost_usd to chat_queries
ingestion/pricing.ts (new)Single source of truth for all rates — Haiku in/out + cache/batch multipliers, the chat model (Sonnet) in/out rates for query cost, and embedding rate

Approach

Instrument bottom-up: (1) pricing.ts + tests (consolidating the inline sdk-client PRICE); (2) thread tokens through the extractor/embed results — including surfacing figure-pass vision tokens as a separate stage — and write them via atomic increments in the worker, computing the doc total at finalize; (3) expose + render in the inspector; (4) persist query cost from the real chat path (/api/chat onFinish, Sonnet rates); (5) the combined Batch+caching spike (S6) is investigative and gated behind its own measurement — see Stories. Verify by ingesting the demo pair via the SDK path locally (real ANTHROPIC_API_KEY, small spend) so the stage breakdown actually populates — the dev Max-CLI path only yields a lump total_cost_usd, no per-stage split. One prose doc (route llm, shows Haiku grouping tokens) + one structured PDF (route code, ~$0 grouping + figure-vision only) — the contrast is the demo.

API / Interface Changes

  • ingestion/pricing.ts (new) — the single source of truth:
    • Rate constants: Haiku in/out, cached-read (0.1×) / cache-write (1.25×), batch (0.5×), the chat model (Sonnet) in/out rates, and the embedding rate.
    • computeIngestCost(tokens: IngestTokens): number — pure, deterministic, unit-tested.
    • computeQueryCost(tokens: QueryTokens): number — pure, Sonnet-rated, unit-tested.
    • Consolidates the existing inline PRICE constant in sdk-client.ts (which sdk-client then imports) so there is exactly one rate table.
  • ExtractUnitResult gains a grouping token set (inputTokens / cachedInputTokens / outputTokens, SDK path) and a separate vision token set (visionInputTokens / visionOutputTokens) sourced from the (newly token-returning) figure pass. CLI path falls back to recordedCostUsd only when token counts aren't available.
  • detectFigures() return type gains its usage (input/output tokens) so the worker can attribute vision cost.
  • Document detail API response gains ingestCostUsd + the stage breakdown.
  • chat_queries insert in /api/chat onFinish gains input_tokens + cost_usd (computed via computeQueryCost).

Key Algorithms / Logic

  • Cost from tokens, not from the envelope. We persist token counts per stage and compute USD via pricing.ts. This makes the displayed cost (a) identical across the dev Max-CLI and prod SDK, and (b) auto-correct if we flip caching/batch on. CLI-only recordedCostUsd is stored as a fallback when tokens are coarse.
  • Figure-vision is its own stage. The grouping call and the figure-vision call are both Haiku but are separate API calls within one unit's extraction. Their tokens are kept apart (vision_* columns) so the structured-doc breakdown — which is almost entirely vision — is legible and not conflated with grouping.
  • Atomic accumulation across parallel units. Extraction runs as one worker invocation per structural unit, in parallel (one SQS extract-job per unit). Per-doc totals must accumulate without lost updates: each unit handler does UPDATE documents SET haiku_input_tokens = haiku_input_tokens + $delta, vision_input_tokens = vision_input_tokens + $delta, ... (atomic increment), keyed by stage. finalize adds embed_tokens and writes the computed ingest_cost_usd.
  • Query cost on the real path. /api/chat already runs streamText (Sonnet) and writes a chat_queries row in onFinish({usage}). Extend that same write to compute cost_usd from usage.inputTokens/usage.outputTokens at Sonnet rates and persist input_tokens. The dev playground (route(), CLI envelope) is secondary — leave as-is.

Data Model Changes

// pricing.ts
interface IngestTokens {
  haikuInputTokens: number;
  haikuCachedInputTokens: number; // cached-read, billed at 0.1×
  haikuOutputTokens: number;
  visionInputTokens: number;
  visionOutputTokens: number;
  embedTokens: number;
}

interface QueryTokens {
  inputTokens: number;   // Sonnet-rated
  outputTokens: number;
}

Migration 013update both the raw-SQL migration and the Drizzle schema (app/lib/db/schema.ts):

  • On documents (or a 1:1 document_ingest_cost): ingest_cost_usd numeric, haiku_input_tokens, haiku_cached_input_tokens, haiku_output_tokens, vision_input_tokens, vision_output_tokens, embed_tokens (all bigint default 0).
  • On chat_queries: input_tokens int, cost_usd numeric.

Edge Cases & Gotchas

ScenarioExpected BehaviorWhy It's Tricky
Parallel unit extraction writing the same doc's costNo lost updatesMust use atomic += increments, not read-modify-write
Dev runs Max CLI (cost "absorbed", no token breakdown)UI total is prod-accurate; stage split only populates on the SDK pathCompute from tokens via pricing.ts; CLI gives a lump total_cost_usd only → verify the breakdown by running the SDK path locally
CLI path lacks per-token countsStore recordedCostUsd fallback for the totalSDK gives tokens; CLI envelope gives cost — total still displays, breakdown does not
Figure-vision call vs grouping callCounted as separate stagesBoth are Haiku, but separate API calls; figure-pass must be changed to return usage (today it's discarded as recordedCostUsd: 0)
Structured doc (route code)Shows ~$0 grouping + figure-vision onlyCost source differs by route — the breakdown must make that legible; the vision stage carries it
Query runs on Sonnet, ingest on HaikuQuery cost uses Sonnet ratesTwo models, two rate sets in pricing.ts; don't price a query at Haiku rates
Prompt caching cache-read vs writeInert today (sub-4096 prefix)cache_read is always 0 until the prefix grows past Haiku's 4096 floor — S6 decides whether to grow it
Migration touches Drizzle-managed tablesNew columns appear in both raw SQL and schema.tsdocuments + chat_queries are Drizzle-defined; a raw-SQL-only migration leaves typed inserts blind to the columns

Stories

StorySummaryStatusPR
S1Pricing moduleingestion/pricing.ts rate constants (Haiku + Sonnet/chat + embedding + cache/batch multipliers) + computeIngestCost and computeQueryCost (pure, unit-tested); consolidate the inline sdk-client PRICENot started
S2Persist per-doc ingest cost — migration 013 (raw SQL + Drizzle schema); thread stage tokens through extractor/embed; change figure-pass/detectFigures to return usage so vision is a separate stage; atomic-increment per stage in extract, finalize-compute in finalizeNot started
S3Inspector cost display — doc API + UI cost line with stage breakdown (Haiku grouping / figure vision / embed) on hoverNot started
S4Query cost tracking — in /api/chat onFinish, compute query cost_usd from Sonnet rates via pricing.ts and persist it + input_tokens to chat_queries (the prod path; playground route() left as-is)Not started
S6Batch API + prompt-caching spike — measure batch latency/discount AND grow-prefix-past-4096 + caching savings + ITPM relief, and how they interact; produce a go/no-go (see below)Not started

(Former S5 "turn caching on" is folded into S6 — at the current prefix size it's a no-op, so it becomes a measured lever in the spike rather than a standalone story.)

S6 — Batch + caching spike (detail). Questions to answer, not code to ship:

  1. Batch latency: real turnaround for a doc's units submitted as one batch vs the current synchronous per-unit path (median + worst case). Minutes or hours in practice?
  2. Progress UX: the app tracks progress_done / progress_total / units_total and renders a progress bar (per [[autri-progress-bar]] work). Batch returns units all-at-once rather than incrementally — how do we represent progress? Options: poll batch status → synthetic progress; "queued/processing/done" states; hybrid (sync the first N units for motion, batch the tail).
  3. Caching savings: does growing the static prefix past Haiku 4.5's 4096-token cacheable floor net out positive? Prefix repeats per unit, so cache-read (0.1×) replaces full input () on units 2..N within the 5-min TTL — worth ~70% off the prefix portion (~10–20% off the grouping call, scaling with unit count). Caveat: grow the prefix with content that also helps extraction (clearer schema docs / a few-shot example) and re-run the chunking eval to confirm no regression — don't pad with filler. The larger prize may be ITPM rate-limit relief (cache reads are excluded from the input-TPM ceiling — the cause of the prod 429s under concurrent uploads), not the dollars.
  4. Interaction: does caching survive batch mode? The cache TTL is 5 minutes but batch jobs process asynchronously over minutes-to-hours — the first unit's cache-write may expire before the tail units run, so cache reads might not land in batch at all. Measure whether batch + caching actually stack or are mutually exclusive in practice.

Success criteria: a one-page recommendation with measured numbers (batch latency, batch discount realized, caching savings + whether it stacks with batch, ITPM-relief observed) and a progress-UX proposal — enough to decide whether to wire batch and/or grow-the-prefix-caching into prod (likely a later epic).

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-06No usage caps/tiers for beta; measure insteadCan't price what we can't see; beta = measuring stickPremium/cheap tiers now (cost savings too small to justify the surface)
2026-06-06Cost shown in the UI per doc, stage-broken-downDirect ask — Dan wants to literally see upload cost; feeds pricing modelLog-only / dashboard-only
2026-06-06Per-doc total + stage breakdown granularityEnough for the UI badge + pricing analysis without per-unit write volumePer-doc only (less insight); per-unit + per-doc (more schema/writes)
2026-06-06Compute cost from tokens via a pricing moduleConsistent across Max-CLI (dev) and SDK (prod); auto-updates with caching/batchTrust the CLI/SDK envelope cost directly (drifts across clients)
2026-06-06Keep ingest synchronous for beta; batch is a spike onlyImmediate cost feedback in the inspector; latency UX unknownAdopt Batch API now (adds latency, muddies the "see cost after upload" UX)
2026-06-06Track query cost tooBeta must measure total COGS (ingest + query), not just ingestIngest-only tracking
2026-06-06bQuery cost on the prod /api/chat path, Sonnet ratesThat's the real product query path (streamText, already writes chat_queries); it runs on Sonnet, not HaikuInstrument the dev route() playground (not real traffic; CLI envelope cost); both paths (second rate source, low value)
2026-06-06bFigure-vision tokens surfaced as a separate stageStructured docs (the pilot) are almost entirely figure-vision cost; today it's discarded (recordedCostUsd: 0) so the breakdown would read ~$0Defer vision tracking (structured docs under-report); lump vision into the Haiku bucket (can't separate grouping from figure cost)
2026-06-06bCaching folded into the batch spike (S6), not a standalone "free savings" storyIt's a confirmed no-op below the 4096 floor; whether to grow the prefix is a measured trade-off (savings + ITPM relief vs prompt-quality risk), and it interacts with batch's TTLKeep S5 as a separate "turn caching on" story (phantom savings); keep cache_control as a documented no-op (no action)
2026-06-06bVerify the stage breakdown by running the SDK path locallyDev Max-CLI returns only a lump cost with no per-token breakdown; the SDK path populates the real stage splitAccept lump cost in dev (breakdown unverified locally); back out synthetic tokens from CLI cost (approximate/hacky)
2026-06-06bpricing.ts is the single rate table; consolidate the inline sdk-client PRICEAvoid two rate sets drifting (the exact risk this log already flags)Add pricing.ts alongside the existing inline PRICE (two sources of truth)
2026-06-06bMigration 013 updates both raw SQL and the Drizzle schemadocuments + chat_queries are Drizzle-defined; raw-SQL-only leaves typed inserts blind to the new columnsRaw-SQL migration only (typed inserts can't write the columns)

Test Layers

LayerApplies?Notes
Unit testsYespricing.ts computeIngestCost + computeQueryCost — rate math (Haiku, Sonnet, cached/batch flags), zero/edge inputs. Pure → runs in the gate.
Integration (DB)ManualIngest a doc locally via the SDK path → assert per-doc cost rows, per-stage token columns (incl. vision), and atomic accumulation across parallel units. Needs Postgres + ANTHROPIC_API_KEY (not gated).
Integration (UI)ManualInspector shows the cost line + breakdown; verify via headless preview (the /hl:ship QA backend).
End-to-endManualIngest one prose doc (route llm, Haiku grouping stage) + one structured PDF (route code, vision stage carries it) via the SDK path; confirm the cost contrast renders correctly.

Required Fixtures

Fixture NameWhat It TestsPriority
pricing.computeIngestCost casesRate math: standard, cached-read discount, batch discount, vision + grouping + embed mixed-stage totals🔴 High
pricing.computeQueryCost casesSonnet in/out rate math; zero/edge inputs🔴 High
prose-vs-structured ingest pair (SDK path)Route-dependent cost source (Haiku grouping vs figure-vision-only) renders correctly🟡 Medium

Verification Rules

  1. computeIngestCost and computeQueryCost must have unit tests covering standard, cached, batch, and the Sonnet query path — these are the numbers users see.
  2. Atomic-increment accumulation must be verified under parallel unit writes (no lost updates), including the separate vision stage.
  3. Displayed cost must be computed from tokens, validated against hand-computed cost for a known doc ingested via the SDK path.
  4. Cost-affecting changes (caching, batch, rate constants, model swaps) require a re-verification of a known doc's displayed cost.

Risks

RiskLikelihoodImpactMitigation
Lost updates accumulating per-doc cost across parallel unitsMediumWrong cost shownAtomic += increments keyed by stage; never read-modify-write
Dev (Max CLI) shows no stage breakdownHighBreakdown unverified locallyVerify via the SDK path locally; CLI path stores the lump cost fallback
Figure-vision cost stays at $0 if figure-pass isn't changedHighStructured-doc breakdown reads ~$0 (misleading)S2 explicitly changes detectFigures to return usage; vision gets its own stage/column
Query priced at the wrong (Haiku) rateMediumWrong query COGSpricing.ts carries the Sonnet rate; computeQueryCost is the only path
Batch latency breaks the progress-bar UXMediumPoor ingest UX if adoptedS6 spike measures it first; batch stays out of prod until resolved
Caching savings overstated / doesn't survive batchMediumPhantom savings in the modelS6 measures the 4096-floor savings AND the TTL-vs-batch-latency interaction before any claim
Query cost unbounded by any ingest quotaMediumCOGS surprise from chatty usersTrack it in beta (this epic); set query limits later from data
Rate constants drift as Anthropic pricing changesLowStale displayed costCentralized in pricing.ts; one edit updates everything

Known Issues / Tech Debt

IssueSeverityNotes
Deterministic-overflow routing not wiredLowRoute code exists and ties Haiku except on dialogue; wire as a quota fallback once caps exist (future epic)
Batch API not in prodLowDeferred by decision; S6 produces the go/no-go
Prompt caching inert at current prefix sizeLowcache_control is in place but sub-4096; S6 decides whether to grow the prefix (cost + ITPM relief vs prompt-quality risk)
Dev playground query cost (route()) not trackedLowIntentional — playground isn't real traffic; only /api/chat query cost is persisted

S6 Results — Batch + Caching Spike (2026-06-06)

Verdict (REVISED after a 3-doc A/B): adopt batch for llm-routed docs — it's ~50% cheaper AND ~2× faster than the production sync path. Skip caching. Full writeup: docs/s6-batch-caching-spike.md (repo). Spend: ~$2.5 across the spike + A/B.

The 3-document batch-vs-sync A/B

Each doc ran through the real SDK pipeline at production concurrency (3) with every request captured, then those exact requests were replayed as one batch. Faithful replay, not reconstruction.

Doc (type)LLM callsSync costBatch costSavedSync latencyBatch latency
docx novel (prose → Haiku grouping)26$0.683$0.358~48%201s91s
Genesis (PDF → Haiku grouping*)45$0.780$0.386~50%192s61s
SRWF26 Technical (structured + figures)2 + 8 vision$0.093$0.046~51%23s45s

Findings

  1. Batch is faster for big docs, not slower (the assumption-flip). Batch requests run on a separate rate-limit bucket, sidestepping our Tier-1 ITPM ceiling (50K) that caps sync at ~3 concurrent. Cost win is unconditional; latency win holds while sync is ITPM-throttled.
  2. Savings concentrate by routing. Prose/verse → Haiku grouping (large batchable surface, real $); structured → mostly code (tiny surface, ~$0.05 even at 50%). Route prose+verse to batch; keep structured on sync.
  3. The only UX cost is the progress bar. request_counts is contractually pinned ({processing: all, succeeded: 0} until the batch ends — no incremental status, no webhooks; research-confirmed). Batched docs show an indeterminate "Processing…"; the bell/notification model already covers "done."
  4. Caching is the weaker lever — skip it. Static prefix is 3,014 tokens, below Haiku 4.5's 4,096 floor (cache_read=0, measured); growing past it saves ~$0.02/doc and doesn't stack with batch (cache TTL expires before async batch runs). Inert cache_control left in place as a documented no-op.
  5. Surprise (separate follow-up): the PDF Bible (Genesis) routed to Haiku grouping, not the deterministic verse path — 45 LLM calls, $0.78. A raw PDF doesn't recover authored verse boundaries → falls to the LLM. If PDF scripture should route deterministic, that's a bigger win than batch for that type.

Follow-ups

  • Open a "batch ingestion" epic: new batch sink for llm-routed units; indeterminate progress state; batch max-latency fallback (SLA <1h but not guaranteed — a stuck batch must not strand a doc).
  • Investigate PDF-scripture routing (Genesis) — deterministic verse path vs LLM grouping.
  • Future experiment (Dan): one-big-batch vs several-small-batches, to recover coarse progress granularity while keeping the discount.

Review

🔒

Enter your access token to view annotations