Rough draft (2026-05-28) — captures the strategy for document versioning + incremental re-ingestion so it can be red/blue-teamed post-beta and implemented in v1.1. NOT in EPIC-4.5 scope. Builds on D15 (logical documents + supersession, already in the schema) and the chunk-level idempotency fix shipping with EPIC-4.5 (per-unit replace + chunks.unit_index).
Problem
When a user re-uploads a document — the same file, or an updated version of one already in a KB — the pipeline re-runs the full render → extract → embed path from scratch. Two costs:
- Wasted spend + time. The LLM extract call (per unit) is the cost driver; re-extracting unchanged content is pure waste — and under the "meter ingestion, be generous on storage" pricing direction (D18), it's waste the user may be metered on.
- Lost continuity. A new version supersedes the old (D15), but we don't carry forward what didn't change or surface what did.
Goal: detect that an upload is a duplicate or a new version of an existing doc, re-extract only what changed, and carry forward the rest.
Two separate problems (don't conflate)
- Chunk-level idempotency — infrastructure correctness under SQS at-least-once delivery. Solved in EPIC-4.5 via per-unit replace (delete a unit's chunks, re-insert) keyed on
chunks.unit_index. Not this doc. - Document versioning + incremental re-ingestion — product value. This doc.
Key insight: diff the deterministic layer, not chunks
Chunk boundaries are LLM-produced and non-deterministic — the same source text re-extracted yields slightly different chunks. So you cannot reliably diff or match chunks across versions. The diff + reuse must happen at the deterministic layer: the parsed paragraphs / structural units (pdftotext + the structure heuristics are byte-stable). Re-extract only the units whose source changed; carry forward the chunks of units whose source is identical.
Strategy sketch (to be red-teamed)
1. Document identity
- Byte-identical →
documents.source_hash(already in schema). Same hash = duplicate upload; skip ingestion, point at the existing doc. - New version of an existing doc → D15's version-detection heuristic (filename + title + structural overlap) yields a candidate "supersedes" link.
2. Unit-level diff
- Compute a per-unit source hash = hash of the unit's input paragraphs (the deterministic bundle the extractor consumes), for both the superseded version and the new upload.
- Match units across versions by content + anchor, not position —
unit_indexshifts when sections are added/removed, so positional matching is wrong. - Classify each new unit: unchanged (source-hash matches a prior unit), changed (anchor matches, source differs), new (no prior match). Prior units with no match are removed.
3. Reuse + re-extract
- Unchanged → carry forward the prior version's chunks + embeddings (no LLM call, no embed call).
- Changed / new → re-extract via the normal per-unit path.
- Removed → drop, or leave on the superseded version per the supersession model.
Open questions (red-team targets)
- Chunk ownership across versions — chunks reference
document_id. Does carry-forward re-point chunks to the new version's row, copy them, or attach chunks to thelogical_document? This is the central data-model decision. - Unit-matching robustness — anchor + source-hash handles re-numbered sections; what about reordered, merged, or split units? A false "unchanged" surfaces stale content as current.
- Version-detection false positives — wrongly treating an unrelated upload as a version of X corrupts the lineage. Confidence threshold + human confirm (D15)?
- Inspector UX — surface "carried forward" vs "re-extracted (changed)" vs "new" so the user can trust the diff. Ties to D15's per-chunk version envelope.
- Cost metering — incremental re-ingestion meters only changed units; reconcile with the EPIC-5 pricing model.
- Worth it at beta scale? Full re-extract is fine until re-uploads are common — validate demand before building.
Dependencies / prior art
- D15 — logical documents + supersession + version-detection heuristic (schema + design exist; version UX deferred).
- EPIC-4.5 — per-unit replace +
chunks.unit_index(the chunk-identity primitive this builds on). - New artifact: a per-unit source hash (hash of the unit's input paragraphs) drives the reuse decision.
Next: /hl:red-team post-beta → /hl:blue-team into a requirements contract → implement in v1.1.