Regression QA Pipeline
Operational design for Crucible's regression sweep — when it runs, how it handles drift, what evidence it produces, and how baselines are maintained.
Status: Drafting Parent: Crucible Design Doc Created: 2026-04-16
Overview
The regression QA agent sweeps all stored baselines for a project to catch unintended visual drift. It runs alongside (not instead of) the feature QA agent. Where feature QA verifies "does the new thing work?", regression QA verifies "did the new thing break anything else?"
The baseline store IS the regression suite — list_baselines returns everything the agent needs to check. No manual test list to maintain.
Trigger Policy
Decision: Sequential — feature QA first, then regression.
The pipeline runs in this order:
- Feature QA agent runs against the PR branch test-env
- Feature QA passes → orchestrator reviews the report and approves updated baselines
- Regression QA agent runs against the same test-env, now with fresh baselines
- Both reports feed the orchestrator's merge/fix decision
This eliminates the "intentional vs. unintentional drift" problem entirely. By the time regression runs, all baselines are current. Any drift regression finds is a real regression — no ambiguity, no classification step.
The trade-off is wall time: regression waits for feature QA + baseline approval (~2-5 min). Worth it — a few minutes of testing saves hours of rework.
Scaling Strategy
As the baseline count grows (15-20+), a single regression agent becomes slow and risks context window bloat from accumulated screenshots. The solution is sharding:
- The regression prompt accepts an optional
baselineslist parameter - If omitted, the agent discovers all baselines via
list_baselines - If provided, it only checks those baselines
- The orchestrator shards by splitting the full baseline list across N agents
- Target: ~10-15 baselines per agent (tuned by context window pressure, not time)
Defer building the sharding orchestration until we hit the pain point. Design the template to accept the parameter now.
Manual Triggers
Two skills for ad-hoc use:
/qa-regression— boots test-env if needed, runs regression sweep against current state, reports back. Good for post-deploy verification or confidence checks./qa-feature— takes a PR number or branch name, boots the branch test-env, runs feature QA, reports back. Good for re-running QA after fixes.
Both Foundry-specific initially (hardcoded project="foundry"), generalizable when the adapter format lands. Build after validating the pipeline end-to-end.
Drift Semantics
Decision: No drift classification needed — baselines are always current when regression runs.
The sequential pipeline (feature QA → approve baselines → regression) means the regression agent never encounters intentional drift. By the time it runs, the orchestrator has already approved any baseline updates that the feature change required.
This makes the regression agent's logic dead simple: does everything match? Yes or no. No nuance, no "known-affected baselines" hint list, no classification step. Any diff that exceeds tolerance is a real regression.
Why This Works
- Feature QA catches whether the new thing works correctly
- The orchestrator reviews feature QA's report and approves baseline updates for pages that intentionally changed
- Regression QA then verifies that nothing else broke — with a clean set of baselines that already reflect the approved changes
Edge Case: Cascading Visual Changes
A feature change might affect pages that weren't in the feature QA's scope. For example, a global CSS change to font size would affect every baselined page. In this case:
- Feature QA passes (the targeted change looks correct)
- Orchestrator approves baselines for the pages feature QA checked
- Regression finds drift on OTHER pages that weren't in feature QA's scope
- Regression reports ISSUES_FOUND — orchestrator triages: is this the expected cascade, or an actual bug?
- If expected cascade: orchestrator approves those baselines too, re-runs regression
- If bug: orchestrator fires a fix agent
This is the one scenario where regression might need a second pass. Acceptable — it only happens with broad visual changes.
Baseline Hygiene
Baselines are maintained as part of the pipeline flow, not as a separate maintenance task.
When Baselines Get Updated
- During feature QA: The feature QA agent identifies pages that changed. The orchestrator approves updated baselines before regression runs.
- During regression (cascade): If regression finds expected drift on pages outside feature QA's scope (e.g., global CSS change), the orchestrator approves those too and re-runs.
- On demand:
/qa-regressioncan be run manually to verify baseline freshness at any time.
Staleness Detection
A baseline is stale when it no longer matches the current state of the app on main. The regression agent detects this automatically — any baseline that fails comparison on a clean main build is stale by definition.
Recovery: Run regression against main, identify which baselines fail, re-baseline them. This is the first thing we need to do before our first regression run (the review-panel-with-drafts baseline is known stale from PR #132).
Retention
- Baselines for pages that still exist: keep indefinitely
- Baselines for pages that were removed: prune when detected (regression agent can't navigate to the URL → report as "unreachable" → orchestrator deletes)
- No automatic expiry — baselines are cheap (PNGs on disk)
QA Evidence Folder Growth
qa-evidence/pr-*/ folders accumulate in the repo. Retention policy TBD — likely prune after PR merges, or keep only the last N. Not blocking for now.
Evidence and Reporting
On PASS
Lightweight PR comment — no screenshot gallery. Feature QA already provides the detailed visual evidence. Regression's PASS is a confidence stamp:
## Regression QA — PASS
- **Baselines checked:** 6/6
- **All match within tolerance**
- **No unexpected drift detected**
Posted as a PR comment alongside the feature QA evidence. Short, scannable, sufficient.
On ISSUES_FOUND
Report to orchestrator only — no PR comment (same policy as feature QA). Evidence includes:
- Which baseline(s) drifted
- Diff score for each
- Screenshot of current state (file path)
- The baseline it was compared against
The orchestrator triages: real bug → fire fix agent, or expected cascade → approve baselines and re-run.
On NEEDS_HUMAN
When the agent can't determine if drift is intentional or a regression. Escalation path: orchestrator presents the regression report (with screenshots) to the human for classification.
QA Report Format
## Verdict: PASS | ISSUES_FOUND | NEEDS_HUMAN
## Baseline Results
| Spec | URL | Match Score | Verdict | Notes |
|------|-----|-------------|---------|-------|
| homepage | http://localhost:3001/ | 1.000000 | pass | |
| sample-doc | http://localhost:3001/docs/... | 0.999998 | pass | |
## Findings (if any)
### [Finding — severity: high/medium/low]
- **What:** [description of the regression]
- **Where:** [page/URL]
- **Evidence:** [screenshot path, diff score]
- **Baseline spec:** [which baseline drifted]
## Coverage
- **Baselines checked:** [N of M]
- **Baselines skipped:** [any unreachable and why]
Test Environment
Decision: Always boot a fresh test-env. Never reuse long-running instances.
The first regression run (2026-04-16) ran against a 3-day-old Docker container. This meant the code under test was stale — any commits since boot were invisible to the QA agent. Fresh builds guarantee current code + clean seed data.
For PR-based Regression (after feature QA)
- Orchestrator boots test-env via
qa.sh <branch>→ fresh container on a random port - Feature QA runs against that URL
- Baselines approved
- Regression QA runs against the same URL — no additional boot needed
- After regression completes, orchestrator tears down via
qa-cleanup.sh <branch>
For Regression on Main
Same script, just targeting main:
test-env/scripts/qa.sh main # boot fresh from current main
test-env/scripts/qa-cleanup.sh main # tear down after
No long-running smoke instance. Every regression run is against a freshly built container.
Seed Data Dependency
Baselines assume deterministic seed data. The test-env boots with the same fixtures every time — this was validated in session 1 (0 diff pixels across consecutive boots). As long as seed fixtures in test-env/seed/ don't change, baselines remain valid.
If seed data changes (new fixtures, modified content), ALL baselines will drift and need re-approval. This is expected and correct — the regression agent will flag it, the orchestrator approves the batch.
Dynamic Content Masking
Decision: Mask dynamic text content via run_script before screenshotting. Never mask structural elements.
The first regression run (2026-04-16) revealed that relative timestamps in the review panel ("3 days ago" vs "6 hours ago") cause ~3-4% pixel drift on every authenticated baseline. This is a systemic false positive — the seed data has fixed created_at dates, but the UI renders relative timestamps that change with wall-clock time.
Masking Rules
- Only mask text content — replace the text string, never hide/remove/resize elements
- Apply after page load, before screenshot — via
run_script - Document every mask in the report — the "Masks Applied" section ensures the orchestrator knows exactly what was hidden
- If unsure, don't mask — let it fail and report it. False negatives (missed regressions) are worse than false positives (noisy reports)
What Masking Preserves
Masking timestamps still catches:
- Element missing or mispositioned (layout shift = pixel diff)
- Wrong font, size, or color (style properties aren't masked)
- Container breaking layout (structural change)
- Element added or removed (structural change)
The only thing hidden is "did the exact time text change" — which is the one thing that's expected to change.
Alternative Considered: Freezing Date.now()
Could freeze the JS clock so relative timestamps compute to a fixed value. Cleaner than text replacement since it works regardless of timestamp format. But risks breaking other time-dependent UI behavior (animations, debouncing, polling). DOM text masking is more surgical and less risky.
Alternative Considered: Regenerating Timestamps to Current Time
Could re-seed with created_at = now() at boot. But timestamps drift within a single run (annotation created "1 minute ago" becomes "3 minutes ago" by the time the agent reaches the 3rd baseline). Less deterministic than masking.
First Regression Run — 2026-04-16
First end-to-end regression sweep against Foundry's baseline suite on main.
Results
| Spec | Match Score | Verdict | Notes |
|---|---|---|---|
| homepage | 0.9944 | fail (0.1%), pass (1%) | Anti-aliasing noise, no visible regression |
| sample-doc | 1.000000 | pass | Perfect match |
| sample-doc-full | 1.000000 | pass | Perfect match |
| review-panel-thread-expanded | 0.9694 | fail | Relative timestamps changed |
| review-panel-reply-buttons-right-aligned | 0.9687 | fail | Relative timestamps changed |
| review-panel-with-drafts | 0.9574 | fail | Timestamps + recently re-baselined |
Overall verdict: NEEDS_HUMAN — no real regressions found, but the timestamp false positives made it impossible to auto-PASS.
What We Learned
- Unauthenticated baselines are pixel-perfect — 0 diff pixels on
sample-docandsample-doc-full. The test-env's deterministic seed claim holds. - Authenticated baselines drift ~3-4% from relative timestamps — systematic false positive across all review-panel baselines. Solved by DOM text masking.
- Homepage has ~0.5% anti-aliasing noise — passes at 1% tolerance. Possibly from font rendering variance between headless Chromium sessions.
navigateresets page context — auth state (localStorage) is lost on each navigation. Must re-authenticate after navigating to a new URL. Efficient approach: group baselines by auth state.- Baseline setup complexity varies widely — some need just a URL, others need auth + modal + theme + draft creation. The baseline catalog in the Foundry template captures this.
New Baselines Added
settings-modal— Settings modal (Light/Dark/System, TTS toggle, auth status)search-modal— Search modal (empty state)sample-doc-dark-mode— Sample doc page in dark theme (representative dark-mode baseline)
Suite grew from 6 to 9 baselines.
Integration with Feature QA
Decision: Sequential, not parallel.
Pipeline Sequence
Orchestrator receives feature request
→ Implementation agent writes code, opens PR
→ Feature QA agent verifies the specific change
→ On PASS: orchestrator approves updated baselines
→ Regression QA agent sweeps all baselines
→ On PASS: orchestrator merges
→ On ISSUES_FOUND: orchestrator fires fix agent → loop
Why Sequential
- Eliminates drift classification — baselines are current when regression runs
- No MCP tool isolation issues — only one agent active at a time
- Simpler orchestrator logic — no need to reconcile conflicting reports
- The ~2-5 min overhead for baseline approval is worth the simplicity
Future: Parallel Regression Agents
When baseline counts grow, the orchestrator can shard regression across multiple agents (see Trigger Policy > Scaling Strategy). Each agent gets a subset of baselines. All run in parallel against the same test-env (read-only, no conflict).
Constraint for true parallelism (regression + feature QA simultaneously): Foundry MCP tools are hardcoded to one backend. Crucible tools are fine (URL is per-call). If we ever need feature QA and regression running at the same time, we'd need per-agent MCP config or a routing proxy. The sequential model sidesteps this entirely. Noted as a future constraint.
Scope
In Scope
- Sweep all baselines for a project
- Report drift with evidence
- Recommend baseline updates for intentional drift
- Run against branch-based or main test-env
Out of Scope
- Approving baselines (orchestrator responsibility)
- Fixing regressions (fix agent's job)
- Running unit or integration tests (vitest/jest own that layer)
- Multi-project regression (one project per run for v1)