Foundry Foundry

Rough draft (2026-05-28) — captures the strategy for document versioning + incremental re-ingestion so it can be red/blue-teamed post-beta and implemented in v1.1. NOT in EPIC-4.5 scope. Builds on D15 (logical documents + supersession, already in the schema) and the chunk-level idempotency fix shipping with EPIC-4.5 (per-unit replace + chunks.unit_index).

Problem

When a user re-uploads a document — the same file, or an updated version of one already in a KB — the pipeline re-runs the full render → extract → embed path from scratch. Two costs:

  • Wasted spend + time. The LLM extract call (per unit) is the cost driver; re-extracting unchanged content is pure waste — and under the "meter ingestion, be generous on storage" pricing direction (D18), it's waste the user may be metered on.
  • Lost continuity. A new version supersedes the old (D15), but we don't carry forward what didn't change or surface what did.

Goal: detect that an upload is a duplicate or a new version of an existing doc, re-extract only what changed, and carry forward the rest.

Two separate problems (don't conflate)

  1. Chunk-level idempotency — infrastructure correctness under SQS at-least-once delivery. Solved in EPIC-4.5 via per-unit replace (delete a unit's chunks, re-insert) keyed on chunks.unit_index. Not this doc.
  2. Document versioning + incremental re-ingestion — product value. This doc.

Key insight: diff the deterministic layer, not chunks

Chunk boundaries are LLM-produced and non-deterministic — the same source text re-extracted yields slightly different chunks. So you cannot reliably diff or match chunks across versions. The diff + reuse must happen at the deterministic layer: the parsed paragraphs / structural units (pdftotext + the structure heuristics are byte-stable). Re-extract only the units whose source changed; carry forward the chunks of units whose source is identical.

Strategy sketch (to be red-teamed)

1. Document identity

  • Byte-identicaldocuments.source_hash (already in schema). Same hash = duplicate upload; skip ingestion, point at the existing doc.
  • New version of an existing doc → D15's version-detection heuristic (filename + title + structural overlap) yields a candidate "supersedes" link.

2. Unit-level diff

  • Compute a per-unit source hash = hash of the unit's input paragraphs (the deterministic bundle the extractor consumes), for both the superseded version and the new upload.
  • Match units across versions by content + anchor, not position — unit_index shifts when sections are added/removed, so positional matching is wrong.
  • Classify each new unit: unchanged (source-hash matches a prior unit), changed (anchor matches, source differs), new (no prior match). Prior units with no match are removed.

3. Reuse + re-extract

  • Unchanged → carry forward the prior version's chunks + embeddings (no LLM call, no embed call).
  • Changed / new → re-extract via the normal per-unit path.
  • Removed → drop, or leave on the superseded version per the supersession model.

Open questions (red-team targets)

  • Chunk ownership across versions — chunks reference document_id. Does carry-forward re-point chunks to the new version's row, copy them, or attach chunks to the logical_document? This is the central data-model decision.
  • Unit-matching robustness — anchor + source-hash handles re-numbered sections; what about reordered, merged, or split units? A false "unchanged" surfaces stale content as current.
  • Version-detection false positives — wrongly treating an unrelated upload as a version of X corrupts the lineage. Confidence threshold + human confirm (D15)?
  • Inspector UX — surface "carried forward" vs "re-extracted (changed)" vs "new" so the user can trust the diff. Ties to D15's per-chunk version envelope.
  • Cost metering — incremental re-ingestion meters only changed units; reconcile with the EPIC-5 pricing model.
  • Worth it at beta scale? Full re-extract is fine until re-uploads are common — validate demand before building.

Dependencies / prior art

  • D15 — logical documents + supersession + version-detection heuristic (schema + design exist; version UX deferred).
  • EPIC-4.5 — per-unit replace + chunks.unit_index (the chunk-identity primitive this builds on).
  • New artifact: a per-unit source hash (hash of the unit's input paragraphs) drives the reuse decision.

Next: /hl:red-team post-beta → /hl:blue-team into a requirements contract → implement in v1.1.

Review

🔒

Enter your access token to view annotations