Regression QA Pipeline

Operational design for Crucible's regression sweep — when it runs, how it handles drift, what evidence it produces, and how baselines are maintained.

Status: Drafting Parent: Crucible Design Doc Created: 2026-04-16

Overview

The regression QA agent sweeps all stored baselines for a project to catch unintended visual drift. It runs alongside (not instead of) the feature QA agent. Where feature QA verifies "does the new thing work?", regression QA verifies "did the new thing break anything else?"

The baseline store IS the regression suite — list_baselines returns everything the agent needs to check. No manual test list to maintain.

Trigger Policy

Decision: Sequential — feature QA first, then regression.

The pipeline runs in this order:

Feature QA agent runs against the PR branch test-env
Feature QA passes → orchestrator reviews the report and approves updated baselines
Regression QA agent runs against the same test-env, now with fresh baselines
Both reports feed the orchestrator's merge/fix decision

This eliminates the "intentional vs. unintentional drift" problem entirely. By the time regression runs, all baselines are current. Any drift regression finds is a real regression — no ambiguity, no classification step.

The trade-off is wall time: regression waits for feature QA + baseline approval (~2-5 min). Worth it — a few minutes of testing saves hours of rework.

Scaling Strategy

As the baseline count grows (15-20+), a single regression agent becomes slow and risks context window bloat from accumulated screenshots. The solution is sharding:

The regression prompt accepts an optional baselines list parameter
If omitted, the agent discovers all baselines via list_baselines
If provided, it only checks those baselines
The orchestrator shards by splitting the full baseline list across N agents
Target: ~10-15 baselines per agent (tuned by context window pressure, not time)

Defer building the sharding orchestration until we hit the pain point. Design the template to accept the parameter now.

Manual Triggers

Two skills for ad-hoc use:

/qa-regression — boots test-env if needed, runs regression sweep against current state, reports back. Good for post-deploy verification or confidence checks.
/qa-feature — takes a PR number or branch name, boots the branch test-env, runs feature QA, reports back. Good for re-running QA after fixes.

Both Foundry-specific initially (hardcoded project="foundry"), generalizable when the adapter format lands. Build after validating the pipeline end-to-end.

Drift Semantics

Decision: No drift classification needed — baselines are always current when regression runs.

The sequential pipeline (feature QA → approve baselines → regression) means the regression agent never encounters intentional drift. By the time it runs, the orchestrator has already approved any baseline updates that the feature change required.

This makes the regression agent's logic dead simple: does everything match? Yes or no. No nuance, no "known-affected baselines" hint list, no classification step. Any diff that exceeds tolerance is a real regression.

Why This Works

Feature QA catches whether the new thing works correctly
The orchestrator reviews feature QA's report and approves baseline updates for pages that intentionally changed
Regression QA then verifies that nothing else broke — with a clean set of baselines that already reflect the approved changes

Edge Case: Cascading Visual Changes

A feature change might affect pages that weren't in the feature QA's scope. For example, a global CSS change to font size would affect every baselined page. In this case:

Feature QA passes (the targeted change looks correct)
Orchestrator approves baselines for the pages feature QA checked
Regression finds drift on OTHER pages that weren't in feature QA's scope
Regression reports ISSUES_FOUND — orchestrator triages: is this the expected cascade, or an actual bug?
If expected cascade: orchestrator approves those baselines too, re-runs regression
If bug: orchestrator fires a fix agent

This is the one scenario where regression might need a second pass. Acceptable — it only happens with broad visual changes.

Baseline Hygiene

Baselines are maintained as part of the pipeline flow, not as a separate maintenance task.

When Baselines Get Updated

During feature QA: The feature QA agent identifies pages that changed. The orchestrator approves updated baselines before regression runs.
During regression (cascade): If regression finds expected drift on pages outside feature QA's scope (e.g., global CSS change), the orchestrator approves those too and re-runs.
On demand: /qa-regression can be run manually to verify baseline freshness at any time.

Staleness Detection

A baseline is stale when it no longer matches the current state of the app on main. The regression agent detects this automatically — any baseline that fails comparison on a clean main build is stale by definition.

Recovery: Run regression against main, identify which baselines fail, re-baseline them. This is the first thing we need to do before our first regression run (the review-panel-with-drafts baseline is known stale from PR #132).

Retention

Baselines for pages that still exist: keep indefinitely
Baselines for pages that were removed: prune when detected (regression agent can't navigate to the URL → report as "unreachable" → orchestrator deletes)
No automatic expiry — baselines are cheap (PNGs on disk)

QA Evidence Folder Growth

qa-evidence/pr-*/ folders accumulate in the repo. Retention policy TBD — likely prune after PR merges, or keep only the last N. Not blocking for now.

Evidence and Reporting

On PASS

Lightweight PR comment — no screenshot gallery. Feature QA already provides the detailed visual evidence. Regression's PASS is a confidence stamp:

## Regression QA — PASS
- **Baselines checked:** 6/6
- **All match within tolerance**
- **No unexpected drift detected**

Posted as a PR comment alongside the feature QA evidence. Short, scannable, sufficient.

On ISSUES_FOUND

Report to orchestrator only — no PR comment (same policy as feature QA). Evidence includes:

Which baseline(s) drifted
Diff score for each
Screenshot of current state (file path)
The baseline it was compared against

The orchestrator triages: real bug → fire fix agent, or expected cascade → approve baselines and re-run.

On NEEDS_HUMAN

When the agent can't determine if drift is intentional or a regression. Escalation path: orchestrator presents the regression report (with screenshots) to the human for classification.

QA Report Format

## Verdict: PASS | ISSUES_FOUND | NEEDS_HUMAN

## Baseline Results
| Spec | URL | Match Score | Verdict | Notes |
|------|-----|-------------|---------|-------|
| homepage | http://localhost:3001/ | 1.000000 | pass | |
| sample-doc | http://localhost:3001/docs/... | 0.999998 | pass | |

## Findings (if any)
### [Finding — severity: high/medium/low]
- **What:** [description of the regression]
- **Where:** [page/URL]
- **Evidence:** [screenshot path, diff score]
- **Baseline spec:** [which baseline drifted]

## Coverage
- **Baselines checked:** [N of M]
- **Baselines skipped:** [any unreachable and why]

Test Environment

Decision: Always boot a fresh test-env. Never reuse long-running instances.

The first regression run (2026-04-16) ran against a 3-day-old Docker container. This meant the code under test was stale — any commits since boot were invisible to the QA agent. Fresh builds guarantee current code + clean seed data.

For PR-based Regression (after feature QA)

Orchestrator boots test-env via qa.sh <branch> → fresh container on a random port
Feature QA runs against that URL
Baselines approved
Regression QA runs against the same URL — no additional boot needed
After regression completes, orchestrator tears down via qa-cleanup.sh <branch>

For Regression on Main

Same script, just targeting main:

test-env/scripts/qa.sh main          # boot fresh from current main
test-env/scripts/qa-cleanup.sh main  # tear down after

No long-running smoke instance. Every regression run is against a freshly built container.

Seed Data Dependency

Baselines assume deterministic seed data. The test-env boots with the same fixtures every time — this was validated in session 1 (0 diff pixels across consecutive boots). As long as seed fixtures in test-env/seed/ don't change, baselines remain valid.

If seed data changes (new fixtures, modified content), ALL baselines will drift and need re-approval. This is expected and correct — the regression agent will flag it, the orchestrator approves the batch.

Dynamic Content Masking

Decision: Mask dynamic text content via run_script before screenshotting. Never mask structural elements.

The first regression run (2026-04-16) revealed that relative timestamps in the review panel ("3 days ago" vs "6 hours ago") cause ~3-4% pixel drift on every authenticated baseline. This is a systemic false positive — the seed data has fixed created_at dates, but the UI renders relative timestamps that change with wall-clock time.

Masking Rules

Only mask text content — replace the text string, never hide/remove/resize elements
Apply after page load, before screenshot — via run_script
Document every mask in the report — the "Masks Applied" section ensures the orchestrator knows exactly what was hidden
If unsure, don't mask — let it fail and report it. False negatives (missed regressions) are worse than false positives (noisy reports)

What Masking Preserves

Masking timestamps still catches:

Element missing or mispositioned (layout shift = pixel diff)
Wrong font, size, or color (style properties aren't masked)
Container breaking layout (structural change)
Element added or removed (structural change)

The only thing hidden is "did the exact time text change" — which is the one thing that's expected to change.

Alternative Considered: Freezing `Date.now()`

Could freeze the JS clock so relative timestamps compute to a fixed value. Cleaner than text replacement since it works regardless of timestamp format. But risks breaking other time-dependent UI behavior (animations, debouncing, polling). DOM text masking is more surgical and less risky.

Alternative Considered: Regenerating Timestamps to Current Time

Could re-seed with created_at = now() at boot. But timestamps drift within a single run (annotation created "1 minute ago" becomes "3 minutes ago" by the time the agent reaches the 3rd baseline). Less deterministic than masking.

First Regression Run — 2026-04-16

First end-to-end regression sweep against Foundry's baseline suite on main.

Results

Spec	Match Score	Verdict	Notes
homepage	0.9944	fail (0.1%), pass (1%)	Anti-aliasing noise, no visible regression
sample-doc	1.000000	pass	Perfect match
sample-doc-full	1.000000	pass	Perfect match
review-panel-thread-expanded	0.9694	fail	Relative timestamps changed
review-panel-reply-buttons-right-aligned	0.9687	fail	Relative timestamps changed
review-panel-with-drafts	0.9574	fail	Timestamps + recently re-baselined

Overall verdict: NEEDS_HUMAN — no real regressions found, but the timestamp false positives made it impossible to auto-PASS.

What We Learned

Unauthenticated baselines are pixel-perfect — 0 diff pixels on sample-doc and sample-doc-full. The test-env's deterministic seed claim holds.
Authenticated baselines drift ~3-4% from relative timestamps — systematic false positive across all review-panel baselines. Solved by DOM text masking.
Homepage has ~0.5% anti-aliasing noise — passes at 1% tolerance. Possibly from font rendering variance between headless Chromium sessions.
navigate resets page context — auth state (localStorage) is lost on each navigation. Must re-authenticate after navigating to a new URL. Efficient approach: group baselines by auth state.
Baseline setup complexity varies widely — some need just a URL, others need auth + modal + theme + draft creation. The baseline catalog in the Foundry template captures this.

New Baselines Added

settings-modal — Settings modal (Light/Dark/System, TTS toggle, auth status)
search-modal — Search modal (empty state)
sample-doc-dark-mode — Sample doc page in dark theme (representative dark-mode baseline)

Suite grew from 6 to 9 baselines.

Integration with Feature QA

Decision: Sequential, not parallel.

Pipeline Sequence

Orchestrator receives feature request
  → Implementation agent writes code, opens PR
  → Feature QA agent verifies the specific change
  → On PASS: orchestrator approves updated baselines
  → Regression QA agent sweeps all baselines
  → On PASS: orchestrator merges
  → On ISSUES_FOUND: orchestrator fires fix agent → loop

Why Sequential

Eliminates drift classification — baselines are current when regression runs
No MCP tool isolation issues — only one agent active at a time
Simpler orchestrator logic — no need to reconcile conflicting reports
The ~2-5 min overhead for baseline approval is worth the simplicity

Future: Parallel Regression Agents

When baseline counts grow, the orchestrator can shard regression across multiple agents (see Trigger Policy > Scaling Strategy). Each agent gets a subset of baselines. All run in parallel against the same test-env (read-only, no conflict).

Constraint for true parallelism (regression + feature QA simultaneously): Foundry MCP tools are hardcoded to one backend. Crucible tools are fine (URL is per-call). If we ever need feature QA and regression running at the same time, we'd need per-agent MCP config or a routing proxy. The sequential model sidesteps this entirely. Noted as a future constraint.

Regression QA Pipeline

Overview

Trigger Policy

Scaling Strategy

Manual Triggers

Drift Semantics

Why This Works

Edge Case: Cascading Visual Changes

Baseline Hygiene

When Baselines Get Updated

Staleness Detection

Retention

QA Evidence Folder Growth

Evidence and Reporting

On PASS

On ISSUES_FOUND

On NEEDS_HUMAN

QA Report Format

Test Environment

For PR-based Regression (after feature QA)

For Regression on Main

Seed Data Dependency

Dynamic Content Masking

Masking Rules

What Masking Preserves

Alternative Considered: Freezing `Date.now()`

Alternative Considered: Regenerating Timestamps to Current Time

First Regression Run — 2026-04-16

Results

What We Learned

New Baselines Added

Integration with Feature QA

Pipeline Sequence

Why Sequential

Future: Parallel Regression Agents

Scope

In Scope

Out of Scope

Review

Regression QA Pipeline#

Overview#

Trigger Policy#

Scaling Strategy#

Manual Triggers#

Drift Semantics#

Why This Works#

Edge Case: Cascading Visual Changes#

Baseline Hygiene#

When Baselines Get Updated#

Staleness Detection#

Retention#

QA Evidence Folder Growth#

Evidence and Reporting#

On PASS#

On ISSUES_FOUND#

On NEEDS_HUMAN#

QA Report Format#

Test Environment#

For PR-based Regression (after feature QA)#

For Regression on Main#

Seed Data Dependency#

Dynamic Content Masking#

Masking Rules#

What Masking Preserves#

Alternative Considered: Freezing Date.now()#

Alternative Considered: Regenerating Timestamps to Current Time#

First Regression Run — 2026-04-16#

Results#

What We Learned#

New Baselines Added#

Integration with Feature QA#

Pipeline Sequence#

Why Sequential#

Future: Parallel Regression Agents#

Scope#

In Scope#

Out of Scope#

Review

Regression QA Pipeline

Overview

Trigger Policy

Scaling Strategy

Manual Triggers

Drift Semantics

Why This Works

Edge Case: Cascading Visual Changes

Baseline Hygiene

When Baselines Get Updated

Staleness Detection

Retention

QA Evidence Folder Growth

Evidence and Reporting

On PASS

On ISSUES_FOUND

On NEEDS_HUMAN

QA Report Format

Test Environment

For PR-based Regression (after feature QA)

For Regression on Main

Seed Data Dependency

Dynamic Content Masking

Masking Rules

What Masking Preserves

Alternative Considered: Freezing `Date.now()`

Alternative Considered: Regenerating Timestamps to Current Time

First Regression Run — 2026-04-16

Results

What We Learned

New Baselines Added

Integration with Feature QA

Pipeline Sequence

Why Sequential

Future: Parallel Regression Agents

Scope

In Scope

Out of Scope