Foundry Foundry

Regression QA Pipeline

Operational design for Crucible's regression sweep — when it runs, how it handles drift, what evidence it produces, and how baselines are maintained.

Status: Drafting Parent: Crucible Design Doc Created: 2026-04-16


Overview

The regression QA agent sweeps all stored baselines for a project to catch unintended visual drift. It runs alongside (not instead of) the feature QA agent. Where feature QA verifies "does the new thing work?", regression QA verifies "did the new thing break anything else?"

The baseline store IS the regression suite — list_baselines returns everything the agent needs to check. No manual test list to maintain.

Trigger Policy

Decision: Sequential — feature QA first, then regression.

The pipeline runs in this order:

  1. Feature QA agent runs against the PR branch test-env
  2. Feature QA passes → orchestrator reviews the report and approves updated baselines
  3. Regression QA agent runs against the same test-env, now with fresh baselines
  4. Both reports feed the orchestrator's merge/fix decision

This eliminates the "intentional vs. unintentional drift" problem entirely. By the time regression runs, all baselines are current. Any drift regression finds is a real regression — no ambiguity, no classification step.

The trade-off is wall time: regression waits for feature QA + baseline approval (~2-5 min). Worth it — a few minutes of testing saves hours of rework.

Scaling Strategy

As the baseline count grows (15-20+), a single regression agent becomes slow and risks context window bloat from accumulated screenshots. The solution is sharding:

  • The regression prompt accepts an optional baselines list parameter
  • If omitted, the agent discovers all baselines via list_baselines
  • If provided, it only checks those baselines
  • The orchestrator shards by splitting the full baseline list across N agents
  • Target: ~10-15 baselines per agent (tuned by context window pressure, not time)

Defer building the sharding orchestration until we hit the pain point. Design the template to accept the parameter now.

Manual Triggers

Two skills for ad-hoc use:

  • /qa-regression — boots test-env if needed, runs regression sweep against current state, reports back. Good for post-deploy verification or confidence checks.
  • /qa-feature — takes a PR number or branch name, boots the branch test-env, runs feature QA, reports back. Good for re-running QA after fixes.

Both Foundry-specific initially (hardcoded project="foundry"), generalizable when the adapter format lands. Build after validating the pipeline end-to-end.

Drift Semantics

Decision: No drift classification needed — baselines are always current when regression runs.

The sequential pipeline (feature QA → approve baselines → regression) means the regression agent never encounters intentional drift. By the time it runs, the orchestrator has already approved any baseline updates that the feature change required.

This makes the regression agent's logic dead simple: does everything match? Yes or no. No nuance, no "known-affected baselines" hint list, no classification step. Any diff that exceeds tolerance is a real regression.

Why This Works

  • Feature QA catches whether the new thing works correctly
  • The orchestrator reviews feature QA's report and approves baseline updates for pages that intentionally changed
  • Regression QA then verifies that nothing else broke — with a clean set of baselines that already reflect the approved changes

Edge Case: Cascading Visual Changes

A feature change might affect pages that weren't in the feature QA's scope. For example, a global CSS change to font size would affect every baselined page. In this case:

  • Feature QA passes (the targeted change looks correct)
  • Orchestrator approves baselines for the pages feature QA checked
  • Regression finds drift on OTHER pages that weren't in feature QA's scope
  • Regression reports ISSUES_FOUND — orchestrator triages: is this the expected cascade, or an actual bug?
  • If expected cascade: orchestrator approves those baselines too, re-runs regression
  • If bug: orchestrator fires a fix agent

This is the one scenario where regression might need a second pass. Acceptable — it only happens with broad visual changes.

Baseline Hygiene

Baselines are maintained as part of the pipeline flow, not as a separate maintenance task.

When Baselines Get Updated

  • During feature QA: The feature QA agent identifies pages that changed. The orchestrator approves updated baselines before regression runs.
  • During regression (cascade): If regression finds expected drift on pages outside feature QA's scope (e.g., global CSS change), the orchestrator approves those too and re-runs.
  • On demand: /qa-regression can be run manually to verify baseline freshness at any time.

Staleness Detection

A baseline is stale when it no longer matches the current state of the app on main. The regression agent detects this automatically — any baseline that fails comparison on a clean main build is stale by definition.

Recovery: Run regression against main, identify which baselines fail, re-baseline them. This is the first thing we need to do before our first regression run (the review-panel-with-drafts baseline is known stale from PR #132).

Retention

  • Baselines for pages that still exist: keep indefinitely
  • Baselines for pages that were removed: prune when detected (regression agent can't navigate to the URL → report as "unreachable" → orchestrator deletes)
  • No automatic expiry — baselines are cheap (PNGs on disk)

QA Evidence Folder Growth

qa-evidence/pr-*/ folders accumulate in the repo. Retention policy TBD — likely prune after PR merges, or keep only the last N. Not blocking for now.

Evidence and Reporting

On PASS

Lightweight PR comment — no screenshot gallery. Feature QA already provides the detailed visual evidence. Regression's PASS is a confidence stamp:

## Regression QA — PASS
- **Baselines checked:** 6/6
- **All match within tolerance**
- **No unexpected drift detected**

Posted as a PR comment alongside the feature QA evidence. Short, scannable, sufficient.

On ISSUES_FOUND

Report to orchestrator only — no PR comment (same policy as feature QA). Evidence includes:

  • Which baseline(s) drifted
  • Diff score for each
  • Screenshot of current state (file path)
  • The baseline it was compared against

The orchestrator triages: real bug → fire fix agent, or expected cascade → approve baselines and re-run.

On NEEDS_HUMAN

When the agent can't determine if drift is intentional or a regression. Escalation path: orchestrator presents the regression report (with screenshots) to the human for classification.

QA Report Format

## Verdict: PASS | ISSUES_FOUND | NEEDS_HUMAN

## Baseline Results
| Spec | URL | Match Score | Verdict | Notes |
|------|-----|-------------|---------|-------|
| homepage | http://localhost:3001/ | 1.000000 | pass | |
| sample-doc | http://localhost:3001/docs/... | 0.999998 | pass | |

## Findings (if any)
### [Finding — severity: high/medium/low]
- **What:** [description of the regression]
- **Where:** [page/URL]
- **Evidence:** [screenshot path, diff score]
- **Baseline spec:** [which baseline drifted]

## Coverage
- **Baselines checked:** [N of M]
- **Baselines skipped:** [any unreachable and why]

Test Environment

Decision: Always boot a fresh test-env. Never reuse long-running instances.

The first regression run (2026-04-16) ran against a 3-day-old Docker container. This meant the code under test was stale — any commits since boot were invisible to the QA agent. Fresh builds guarantee current code + clean seed data.

For PR-based Regression (after feature QA)

  1. Orchestrator boots test-env via qa.sh <branch> → fresh container on a random port
  2. Feature QA runs against that URL
  3. Baselines approved
  4. Regression QA runs against the same URL — no additional boot needed
  5. After regression completes, orchestrator tears down via qa-cleanup.sh <branch>

For Regression on Main

Same script, just targeting main:

test-env/scripts/qa.sh main          # boot fresh from current main
test-env/scripts/qa-cleanup.sh main  # tear down after

No long-running smoke instance. Every regression run is against a freshly built container.

Seed Data Dependency

Baselines assume deterministic seed data. The test-env boots with the same fixtures every time — this was validated in session 1 (0 diff pixels across consecutive boots). As long as seed fixtures in test-env/seed/ don't change, baselines remain valid.

If seed data changes (new fixtures, modified content), ALL baselines will drift and need re-approval. This is expected and correct — the regression agent will flag it, the orchestrator approves the batch.

Dynamic Content Masking

Decision: Mask dynamic text content via run_script before screenshotting. Never mask structural elements.

The first regression run (2026-04-16) revealed that relative timestamps in the review panel ("3 days ago" vs "6 hours ago") cause ~3-4% pixel drift on every authenticated baseline. This is a systemic false positive — the seed data has fixed created_at dates, but the UI renders relative timestamps that change with wall-clock time.

Masking Rules

  1. Only mask text content — replace the text string, never hide/remove/resize elements
  2. Apply after page load, before screenshot — via run_script
  3. Document every mask in the report — the "Masks Applied" section ensures the orchestrator knows exactly what was hidden
  4. If unsure, don't mask — let it fail and report it. False negatives (missed regressions) are worse than false positives (noisy reports)

What Masking Preserves

Masking timestamps still catches:

  • Element missing or mispositioned (layout shift = pixel diff)
  • Wrong font, size, or color (style properties aren't masked)
  • Container breaking layout (structural change)
  • Element added or removed (structural change)

The only thing hidden is "did the exact time text change" — which is the one thing that's expected to change.

Alternative Considered: Freezing Date.now()

Could freeze the JS clock so relative timestamps compute to a fixed value. Cleaner than text replacement since it works regardless of timestamp format. But risks breaking other time-dependent UI behavior (animations, debouncing, polling). DOM text masking is more surgical and less risky.

Alternative Considered: Regenerating Timestamps to Current Time

Could re-seed with created_at = now() at boot. But timestamps drift within a single run (annotation created "1 minute ago" becomes "3 minutes ago" by the time the agent reaches the 3rd baseline). Less deterministic than masking.

First Regression Run — 2026-04-16

First end-to-end regression sweep against Foundry's baseline suite on main.

Results

SpecMatch ScoreVerdictNotes
homepage0.9944fail (0.1%), pass (1%)Anti-aliasing noise, no visible regression
sample-doc1.000000passPerfect match
sample-doc-full1.000000passPerfect match
review-panel-thread-expanded0.9694failRelative timestamps changed
review-panel-reply-buttons-right-aligned0.9687failRelative timestamps changed
review-panel-with-drafts0.9574failTimestamps + recently re-baselined

Overall verdict: NEEDS_HUMAN — no real regressions found, but the timestamp false positives made it impossible to auto-PASS.

What We Learned

  1. Unauthenticated baselines are pixel-perfect — 0 diff pixels on sample-doc and sample-doc-full. The test-env's deterministic seed claim holds.
  2. Authenticated baselines drift ~3-4% from relative timestamps — systematic false positive across all review-panel baselines. Solved by DOM text masking.
  3. Homepage has ~0.5% anti-aliasing noise — passes at 1% tolerance. Possibly from font rendering variance between headless Chromium sessions.
  4. navigate resets page context — auth state (localStorage) is lost on each navigation. Must re-authenticate after navigating to a new URL. Efficient approach: group baselines by auth state.
  5. Baseline setup complexity varies widely — some need just a URL, others need auth + modal + theme + draft creation. The baseline catalog in the Foundry template captures this.

New Baselines Added

  • settings-modal — Settings modal (Light/Dark/System, TTS toggle, auth status)
  • search-modal — Search modal (empty state)
  • sample-doc-dark-mode — Sample doc page in dark theme (representative dark-mode baseline)

Suite grew from 6 to 9 baselines.

Integration with Feature QA

Decision: Sequential, not parallel.

Pipeline Sequence

Orchestrator receives feature request
  → Implementation agent writes code, opens PR
  → Feature QA agent verifies the specific change
  → On PASS: orchestrator approves updated baselines
  → Regression QA agent sweeps all baselines
  → On PASS: orchestrator merges
  → On ISSUES_FOUND: orchestrator fires fix agent → loop

Why Sequential

  • Eliminates drift classification — baselines are current when regression runs
  • No MCP tool isolation issues — only one agent active at a time
  • Simpler orchestrator logic — no need to reconcile conflicting reports
  • The ~2-5 min overhead for baseline approval is worth the simplicity

Future: Parallel Regression Agents

When baseline counts grow, the orchestrator can shard regression across multiple agents (see Trigger Policy > Scaling Strategy). Each agent gets a subset of baselines. All run in parallel against the same test-env (read-only, no conflict).

Constraint for true parallelism (regression + feature QA simultaneously): Foundry MCP tools are hardcoded to one backend. Crucible tools are fine (URL is per-call). If we ever need feature QA and regression running at the same time, we'd need per-agent MCP config or a routing proxy. The sequential model sidesteps this entirely. Noted as a future constraint.

Scope

In Scope

  • Sweep all baselines for a project
  • Report drift with evidence
  • Recommend baseline updates for intentional drift
  • Run against branch-based or main test-env

Out of Scope

  • Approving baselines (orchestrator responsibility)
  • Fixing regressions (fix agent's job)
  • Running unit or integration tests (vitest/jest own that layer)
  • Multi-project regression (one project per run for v1)

Review

🔒

Enter your access token to view annotations