projects/autri/archive/sub-systems/incremental-re-ingestion

Rough draft (2026-05-28) — captures the strategy for document versioning + incremental re-ingestion so it can be red/blue-teamed post-beta and implemented in v1.1. NOT in EPIC-4.5 scope. Builds on D15 (logical documents + supersession, already in the schema) and the chunk-level idempotency fix shipping with EPIC-4.5 (per-unit replace + chunks.unit_index).

Problem

When a user re-uploads a document — the same file, or an updated version of one already in a KB — the pipeline re-runs the full render → extract → embed path from scratch. Two costs:

Wasted spend + time. The LLM extract call (per unit) is the cost driver; re-extracting unchanged content is pure waste — and under the "meter ingestion, be generous on storage" pricing direction (D18), it's waste the user may be metered on.
Lost continuity. A new version supersedes the old (D15), but we don't carry forward what didn't change or surface what did.

Goal: detect that an upload is a duplicate or a new version of an existing doc, re-extract only what changed, and carry forward the rest.

Two separate problems (don't conflate)

Chunk-level idempotency — infrastructure correctness under SQS at-least-once delivery. Solved in EPIC-4.5 via per-unit replace (delete a unit's chunks, re-insert) keyed on chunks.unit_index. Not this doc.
Document versioning + incremental re-ingestion — product value. This doc.

Key insight: diff the deterministic layer, not chunks

Chunk boundaries are LLM-produced and non-deterministic — the same source text re-extracted yields slightly different chunks. So you cannot reliably diff or match chunks across versions. The diff + reuse must happen at the deterministic layer: the parsed paragraphs / structural units (pdftotext + the structure heuristics are byte-stable). Re-extract only the units whose source changed; carry forward the chunks of units whose source is identical.

Strategy sketch (to be red-teamed)

1. Document identity

Byte-identical → documents.source_hash (already in schema). Same hash = duplicate upload; skip ingestion, point at the existing doc.
New version of an existing doc → D15's version-detection heuristic (filename + title + structural overlap) yields a candidate "supersedes" link.

2. Unit-level diff

Compute a per-unit source hash = hash of the unit's input paragraphs (the deterministic bundle the extractor consumes), for both the superseded version and the new upload.
Match units across versions by content + anchor, not position — unit_index shifts when sections are added/removed, so positional matching is wrong.
Classify each new unit: unchanged (source-hash matches a prior unit), changed (anchor matches, source differs), new (no prior match). Prior units with no match are removed.

3. Reuse + re-extract

Unchanged → carry forward the prior version's chunks + embeddings (no LLM call, no embed call).
Changed / new → re-extract via the normal per-unit path.
Removed → drop, or leave on the superseded version per the supersession model.

Open questions (red-team targets)

Chunk ownership across versions — chunks reference document_id. Does carry-forward re-point chunks to the new version's row, copy them, or attach chunks to the logical_document? This is the central data-model decision.
Unit-matching robustness — anchor + source-hash handles re-numbered sections; what about reordered, merged, or split units? A false "unchanged" surfaces stale content as current.
Version-detection false positives — wrongly treating an unrelated upload as a version of X corrupts the lineage. Confidence threshold + human confirm (D15)?
Inspector UX — surface "carried forward" vs "re-extracted (changed)" vs "new" so the user can trust the diff. Ties to D15's per-chunk version envelope.
Cost metering — incremental re-ingestion meters only changed units; reconcile with the EPIC-5 pricing model.
Worth it at beta scale? Full re-extract is fine until re-uploads are common — validate demand before building.

Dependencies / prior art

D15 — logical documents + supersession + version-detection heuristic (schema + design exist; version UX deferred).
EPIC-4.5 — per-unit replace + chunks.unit_index (the chunk-identity primitive this builds on).
New artifact: a per-unit source hash (hash of the unit's input paragraphs) drives the reuse decision.

Next: /hl:red-team post-beta → /hl:blue-team into a requirements contract → implement in v1.1.

Problem#

Two separate problems (don't conflate)#

Key insight: diff the deterministic layer, not chunks#

Strategy sketch (to be red-teamed)#

Open questions (red-team targets)#

Dependencies / prior art#