Crucible — Project-Agnostic Test Harness & Visual Regression MCP

One-line pitch: An agentic QA framework that gives AI agents eyes. Crucible provides visual verification, baseline management, and repeatable test environments — while agents use your app's own MCP tools (or generic browser tools as a fallback) for interaction. The agent is the test runner.

Status: Refining — v0.1 "Eyes Only" validated end-to-end, reframing vision based on agentic QA insight. Created: 2026-04-11 Updated: 2026-04-13 — major reframe from "Playwright wrapper + spec runner" to "visual verification layer for agentic QA" Third package in the @claymore-dev suite, alongside Foundry (doc review) and Anvil (semantic search).

Scope & Roadmap

v0.1 — "Eyes Only" ✅

4 MCP tools: navigate, screenshot_page, compare_screenshots, approve_baseline
Hardcoded for Foundry test-env, no adapter format
Validated 2026-04-13: Full eyes flow working end-to-end via live MCP. Pixel-deterministic — 0 diff pixels across consecutive screenshots of the same page. Two baselines established (homepage, sample-doc-full).

v0.2 — "Harness + Fallback Browser"

E2 (Harness MVP): boot_project, teardown_project, seed, healthcheck
E4 (Adapter format): YAML project adapter schema
Generic browser fallback tools: click, type, scroll, wait_for
Success criterion: boot_project("foundry") → harnessed instance → agent uses Foundry MCP tools to interact + Crucible eyes to verify → teardown_project

v0.3 — "Agentic QA"

E6 (Parallel isolation): multiple agents can run Crucible simultaneously
E7 (QA Prompt/Skill): a reusable prompt template or Claude Code skill that orchestrates the full QA flow — "verify this branch against these baselines using available MCP tools"
E8 (Foundry dogfood): Foundry's review panel, annotation threading, and markdown rendering are covered by agentic QA
Success criterion: agent receives a QA prompt, autonomously explores Foundry using Foundry MCP tools + Crucible eyes, and produces a structured pass/fail report with visual evidence

v1.0 — "Second Consumer"

Onboard a second project (likely Routr or GMPPU race strategy)
Freeze the adapter format
Publish @claymore-dev/crucible on npm
Success criterion: a non-Foundry project's first agentic QA run produces useful results with no changes to Crucible core

Non-Goals (v1)

Not a unit test framework — vitest/jest/etc. own that layer
Not a Playwright replacement — Crucible exposes browser tools as a fallback, not as a primary interface. Apps should build their own MCP tools for the best experience.
Not a script runner — no deterministic step-by-step spec execution. The agent uses judgment, not a script.
Not a CI service — it's invoked locally by agents; CI integration is an optional extension
Not a visual design tool — no Figma-style compare, just pixel diffs
Not cross-device testing — desktop browsers only for v1 (mobile emulation is a fast-follow)

Competitive Landscape

Researched 2026-04-13. The agentic QA space is crowded with "AI writes scripts faster" but almost empty on "AI autonomously explores and finds issues."

Tool / Category	What It Does	Agentic?	Visual?	MCP-Aware?
QA Wolf, Momentic, Carbonate	AI translates natural language → Playwright scripts	No — AI is the author, not the tester	No	No
Applitools, Percy, Chromatic	Visual regression diffing	No — verification layer only, no interaction	Yes	No
Octomind	AI discovers and generates e2e tests by crawling	Partial — outputs scripted tests	No	No
LaVague, BrowserUse	LLM-driven browser agents (see page, decide action)	Yes — but generic browser-level, no domain tools	Screenshot-based	No
Playwright MCP (Microsoft)	MCP server wrapping Playwright browser control	Infrastructure only	No	Yes
Crucible	Visual verification + baseline management + harness, designed for agents using app-specific MCP tools	Yes — agent-as-test-runner with domain-tool interaction	Yes — core	Yes — native

The gap Crucible fills: No existing tool combines app-specific MCP tools for semantic interaction + visual verification + agent judgment in an autonomous QA loop. The pieces exist separately; nobody has composed them.

Open Questions

Most original open questions were resolved in the 2026-04-13 design review. Remaining:

QA prompt template validation: The markdown template needs to be tested with a real Foundry QA run. Does the agent get enough guidance from the template, or does it need more structure? First test will tell us.
Report format portability: The QA report is markdown now. Should it also have a structured JSON representation for programmatic consumption by orchestrators? Or is markdown parsing good enough?
Baseline approval UX for humans: When the QA agent proposes baseline updates, how does the human review them? Just look at the PNGs on disk? A diff viewer? This matters once baselines start accumulating.
Multi-page QA sessions: How does the agent decide which pages to visit during exploratory QA? Does the prompt template need to list known routes, or can the agent discover them from navigation/sitemap?

Risks & Constraints

Security Model

AI Interface Architecture

This section is the whole point of the project.

The Interaction Model — Three Layers

Agentic QA involves three layers of tooling. Crucible's design is built around where each layer's boundary falls:

Layer	What It Does	Who Owns It	Examples
Eyes (visual verification)	Screenshot, diff, baseline management, verdicts	Crucible (core)	`screenshot_page`, `compare_screenshots`, `approve_baseline`
Harness (environment)	Boot containerized env, seed, teardown, healthcheck	Crucible (core)	`boot_project`, `teardown_project`, `seed`
Generic browser interaction	Navigate, click, type, scroll, wait	Crucible (fallback)	`navigate`, `click`, `type`, `scroll`
Domain-specific interaction	Semantically meaningful app actions	The app's own MCP tools	Foundry's `create_annotation`, `submit_review`

The principle: Crucible is most powerful when agents interact through domain-specific MCP tools — they operate at the semantic level of the application, not the pixel level. An agent that calls create_annotation("This needs review") understands what it's doing in a way that click(452, 318) never can.

But not every app has MCP tools. Crucible's generic browser interaction layer (navigate, click, type, scroll) exists as a fallback — training wheels that make Crucible useful on day one for any web app with a URL. As apps mature into their own MCP tools, the agent naturally shifts to those.

App Maturity Spectrum

App Maturity	Interaction Method	QA Fidelity	Setup Required
Has rich MCP tools (e.g. Foundry)	Agent uses app's own MCP tools	Highest — semantic interaction, the agent understands what it's doing	App MCP server + Crucible
Has some MCP tools	Mix of app tools + Crucible browser fallback	High — semantic where possible, pixel-level where not	Partial app MCP + Crucible
No MCP tools, just a URL	Crucible's generic browser tools (click, type, scroll)	Moderate — works but fragile, selector-dependent	Crucible only

MCP Tool Surface

Eyes (core — always available):

Tool	Purpose
`navigate`	Go to a URL. Launches browser on first call. Returns final URL + status.
`screenshot_page`	Full-page or viewport PNG screenshot. Cached in-session for subsequent compare.
`screenshot_element`	Element-scoped screenshot by selector.
`compare_screenshots`	Diff a screenshot against a stored baseline. Returns match score + verdict.
`approve_baseline`	Write current screenshot as the baseline for a project/spec.
`list_baselines`	List baselines for a project/spec.

Harness (core — available when project adapters are configured):

Tool	Purpose
`boot_project`	Boot a project by adapter name. Returns handle + entry URL.
`teardown_project`	Clean up containers, networks, volumes for a handle.
`seed`	Run the project's seed command inside the harnessed container.
`healthcheck`	Probe the harnessed app until it responds or times out.

Generic browser interaction (fallback — for apps without their own MCP tools):

Tool	Purpose
`click`	Click an element by selector.
`type`	Type text into an input by selector.
`scroll`	Scroll the page or an element.
`wait_for`	Wait for a selector, network idle, or timeout.
`get_dom`	Return current DOM snapshot (for agent reasoning about page state).

Not in Crucible — belongs to the app:

Domain-specific tools like create_annotation, submit_review, add_to_cart, create_user — these are semantically meaningful actions that only the app knows about. Crucible can't and shouldn't try to own these. Apps that want the best agentic QA experience should expose their own MCP tools for the actions agents need to perform during testing.

Exposure Strategy

Environment	Available?	How?
Development	Yes	Always-on when Crucible MCP is configured
Staging	N/A	Crucible is a local tool, no hosted staging
Production	N/A	Same — no hosted production

Why This Matters

Traditional visual regression is expensive: someone writes Playwright scripts, maintains them, babysits CI. AI-assisted testing (Momentic, Carbonate) just makes script authoring faster — the model is still scripted, still brittle.

Agentic QA with Crucible is fundamentally different: the implementation agent commits its work, then a separate QA agent verifies it — independent verification, not self-review. The QA agent uses domain-specific tools to interact with the app at a semantic level, uses Crucible to see the result, and applies judgment to decide whether it looks right. Meanwhile, a parallel regression agent sweeps all existing baselines for drift. No scripts to maintain. No selectors to update. The agents adapt.

Agentic QA Pipeline

This is the end-to-end flow that Crucible enables. A separate QA agent verifies the implementation agent's work — independent verification, not self-review.

The Pipeline

Orchestrator agent receives a feature request
Fires implementation sub-agent → writes code, runs unit tests, commits
Fires two QA sub-agents in parallel:
- Feature QA agent — focused on the specific change. Gets a prompt template with success criteria, what changed, and targeted pages. Verifies the feature works as intended.
- Regression QA agent — focused on everything else. Discovers all existing baselines via list_baselines and sweeps them for drift. No manual test list — the baseline store IS the regression suite.
Both QA agents use app MCP tools + Crucible eyes to verify
Both return structured markdown reports → orchestrator decides: merge, or fire a fix agent
If issues found: fix agent makes changes → QA agents run again → loop until clean
Escalation threshold: if 3 iterations produce no progress, escalate to human (NEEDS_HUMAN verdict)

Feature QA Prompt Template

The orchestrator fills this in based on what the implementation agent changed. The QA agent treats success criteria as must-pass and the "also check" section as exploratory. No scripted steps — the agent uses judgment.

## QA Objective
Verify [feature description]

## What Changed
[Orchestrator fills this from the implementation agent's output —
files changed, components affected, summary of the change]

## Where to Look
- Primary pages: [URLs where the feature change is visible]
- Entry point: [how to navigate to the feature]
- Related pages: [pages that might be affected by the change]

## Success Criteria
- [ ] [specific visual/functional check]
- [ ] [specific visual/functional check]
- [ ] [specific visual/functional check]

## Also Check
- Regression in [related areas]
- [Known fragile areas from baselines]

## Available Tools
- App MCP tools: [list of domain-specific tools available]
- Crucible: navigate, screenshot_page, compare_screenshots, approve_baseline
- Crucible baselines: [list of existing baselines for this project]

Regression QA Prompt Template

The regression agent discovers its test suite from the baseline store — no manual maintenance needed. New baselines appear automatically as features are approved and merged.

## Regression Suite — [Project Name]
Run `list_baselines(project: "[project]")` to get the full baseline list.
For each baseline:
1. Navigate to the baseline's URL
2. Screenshot the page
3. Compare against the stored baseline
4. Flag any diffs that exceed tolerance

## Known Fragile Areas
- [Manually curated warnings about areas prone to false positives]

## Verdict Rules
- All baselines match: PASS
- Any baseline drifted unexpectedly: ISSUES_FOUND (include evidence)
- Uncertain whether drift is intentional: NEEDS_HUMAN

This starts as a markdown template, not a skill. Validate the pattern with Foundry as the first consumer. Codify into a reusable skill once the pattern proves out.

QA Report Structure

Both QA agents return this format to the orchestrator. Markdown — LLMs read/write it natively, humans can scan it directly, no JSON parsing needed. Revisit only if a programmatic consumer appears.

## Verdict: PASS | ISSUES_FOUND | NEEDS_HUMAN

## Findings
### [Finding 1 — severity: high/medium/low]
- **What:** [description of the issue]
- **Where:** [page/component/URL]
- **Evidence:** [screenshot path, baseline path, diff score]
- **Suggested fix:** [if the agent has an opinion]

### [Finding 2...]

## Baselines
- **Updated:** [baselines intentionally updated — agent judged the change as correct]
- **New:** [new baselines proposed for new pages/components]
- **Failed:** [baselines where the diff exceeded tolerance]

## Coverage
- **Pages visited:** [list of URLs]
- **MCP tools used:** [app tools + Crucible tools invoked]
- **Areas not checked:** [anything the agent couldn't reach or chose to skip]

Verdicts:

PASS — all success criteria met, no regressions detected, baselines match or intentionally updated
ISSUES_FOUND — specific problems identified with evidence. Orchestrator fires fix agents per finding.
NEEDS_HUMAN — agent isn't confident enough to pass or fail. Better to escalate than silently pass something wrong.

Retry loop: orchestrator reads findings → fires fix agent with finding details as context → fix agent commits → QA agents run again. Max 3 iterations before escalating to human.

Data Model

System Architecture

Overview

What Is This?

Crucible is an agentic QA framework — it gives AI agents the ability to see what they changed and verify it visually. It combines three concerns:

Eyes — Headless browser screenshots, pixel-level diffs against stored baselines, baseline approval workflow. The agent can see the UI and compare it against known-good states.
Harness — Docker-compose orchestration, deterministic seed layer, parallel isolation. The agent can boot a known-good environment every time, the same way.
Generic browser interaction (fallback) — Navigate, click, type, scroll. Available for apps that don't have their own MCP tools, but positioned as training wheels, not the primary interaction model.

The key insight: Crucible does NOT try to be a Playwright wrapper. The best agentic QA happens when agents interact with apps through domain-specific MCP tools — Foundry's create_annotation, not click at (x, y). Crucible provides the visual verification layer; your app's own tools provide the semantic interaction. For apps without MCP tools, Crucible's generic browser tools are a workable fallback.

This is a fundamentally different approach from existing AI testing tools, which focus on "AI writes Playwright scripts faster." In Crucible's model, the agent is the test runner — it uses judgment, explores, and decides what to verify. No scripted steps, no brittle selectors, no predetermined assertions.

Why It Exists

Agents can write code and run tests, but they're blind to UI. Every feature pipeline ends at the same handoff: "…and now the human visually QAs it and merges." That's the bottleneck.

Existing AI testing tools don't solve this — they help humans write scripts faster (Momentic, Carbonate, QA Wolf) or do visual diffing without agent integration (Applitools, Percy). Nobody has built the combination of:

App-specific MCP tools for semantically meaningful interaction (not blind pixel clicking)
Visual verification with baseline management (screenshot + diff + approval)
Agent judgment for exploration, prioritization, and pass/fail decisions

The market is full of "AI helps you write tests faster." It's empty on "AI autonomously QAs using domain-specific tools + visual verification." That's the gap Crucible fills.

Crucible closes the loop: implementation agent opens a PR → Crucible boots a clean environment → the QA agent uses the app's own MCP tools to interact and Crucible's eyes to verify → returns pass/fail with visual evidence → orchestrator merges or fires a fix agent.

Who Is It For?

AI coding agents running feature pipelines that need to verify UI changes — the primary audience
Developers who want agentic QA for their web projects without writing Playwright scripts
Apps with MCP tools — Crucible is most powerful here. The agent uses your app's semantic tools for interaction and Crucible for visual verification
Apps without MCP tools — Crucible's generic browser fallback tools still work. Lower fidelity, but zero setup beyond a URL

Target Use Cases

Project	How Crucible + App MCP Tools Work Together
Foundry	Agent uses Foundry MCP tools (`create_annotation`, `submit_review`, etc.) to interact, Crucible to screenshot and verify review panel rendering, annotation threading, markdown display
Routr / CNC app	Agent uses app tools to create toolpaths and configure jobs, Crucible to verify 2D/3D canvas rendering matches baselines
Any web app with MCP tools	Agent uses domain tools for semantic interaction + Crucible for visual verification. Best experience.
Any web app without MCP tools	Agent uses Crucible's generic browser tools (click, type) + Crucible's eyes. Works out of the box with just a URL.

Business Model

Internal-first, open-source-second. Crucible is built to unblock Claymore's agent pipelines. Once stable, it's a strong candidate for @claymore-dev publication alongside Foundry and Anvil — agentic QA with first-class visual verification is genuinely differentiated and unoccupied territory in the MCP ecosystem.

No direct monetization planned for v1. Long-term, hosted baseline storage + team-shared approval workflow is a plausible paid tier.

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-04-11	Renamed from Lookout to Crucible	On-brand with Foundry/Anvil/Claymore metalworking lineup; expanded scope needed a bigger name ("crucible" = severe test)	Caliper, Dial, Jig, Plumb
2026-04-11	Merged "orchestration harness" (originally foundry E13) with visual regression into one project	Two halves of the same problem — eyes and a repeatable environment. Separate projects would duplicate Docker/MCP/session code	Keep harness as a separate sister project
2026-04-11	MCP-native from day one, not CLI-first	Agents are the primary users. CLI is a fallback for humans.	CLI-first with MCP wrapper later
2026-04-11	Docker-based harness, not host-native	Reproducibility and parallel isolation matter more than 30-60s cold-start cost	Host-native scripts with clean-state hooks
2026-04-11	Baselines stored on local filesystem, not cloud	Simplicity for v1; local-first matches the Claymore stack	S3, Git LFS, cloud-hosted baseline service
2026-04-13	Reframed from "test harness + visual regression" to "agentic QA framework"	Market research showed the space is full of "AI writes scripts faster" but empty on "AI autonomously QAs using domain tools + visual verification." The agent-as-test-runner model is genuinely novel and unoccupied.	Keep the scripted spec runner approach
2026-04-13	Three-layer interaction model: eyes (core) + harness (core) + generic browser (fallback)	Crucible is most powerful when apps have their own MCP tools. But requiring app MCP tools kills adoption. Generic browser tools serve as training wheels.	(A) Eyes-only, require Playwright MCP alongside; (B) Full browser tools as primary; (C) Tiered ← chosen
2026-04-13	Removed scripted spec runner from core architecture	The agent receives a QA prompt + success criteria and uses judgment. No deterministic step-by-step execution.	Keep spec runner as an alternative mode
2026-04-13	Two-agent QA model: feature QA + regression QA in parallel	Separating concerns keeps scope manageable. Feature agent checks the specific change; regression agent sweeps all baselines. Run in parallel for speed. Neither agent is overwhelmed.	Single agent does both feature + regression checking
2026-04-13	Regression suite derived from baseline store, not manually maintained	`list_baselines` returns all baselines — that IS the regression suite. No manual curation, no drift. New baselines appear automatically via `approve_baseline`.	Manually maintained regression prompt (prone to drift)
2026-04-13	Separate QA agent verifies implementation agent's work	Independent verification is more trustworthy than self-review. The agent that wrote the code should not be the agent that judges it.	Same agent implements and verifies
2026-04-13	QA prompt template (markdown, not a skill)	Start with a simple markdown template. Validate with Foundry first. Codify into a skill only after the pattern proves out.	Claude Code skill from day one, structured YAML
2026-04-13	Structured QA report in markdown with PASS/ISSUES_FOUND/NEEDS_HUMAN verdicts	Report must be actionable for the orchestrator. Markdown because LLMs read/write it natively and humans can scan it. JSON only if a programmatic consumer appears.	JSON report, simple pass/fail boolean
2026-04-13	3-iteration retry limit before human escalation	Autonomous fix loops need a circuit breaker. 3 fix→QA cycles with no progress → escalate.	No limit (risky), 1 attempt (too conservative)
2026-04-13	Baseline drift: QA agent can propose updates	Agent judges "changed but correct" → proposes baseline update in report. Orchestrator/human approves batch.	Only humans can update baselines
2026-04-13	Screenshot output: save to file, return path	Agents need to see to judge correctness. Save PNG to file, return path + metadata. No base64 in MCP response.	Return base64 inline (overflows), metadata-only (can't judge)
2026-04-13	Auth: no-auth + cookie injection for v1	Most test envs don't need auth. Cookie injection via Playwright's `context.addCookies()` for those that do. `storageState` export as OAuth workaround.	Build OAuth automation for v1
2026-04-13	Dynamic content masking deferred to v2	Agent judgment + `networkidle` + prompt conventions handle 80%. Build masking when real usage shows what breaks.	Build masking infrastructure for v1
2026-04-13	Computer-use integration deferred	`compare_screenshots` already accepts `pngBase64` from any source. Formal integration when a native-app QA use case arises.	Build integration for v1

Epic Index

Epic	Status	Summary
E1: Eyes MVP	Done (v0.1)	Screenshot + pixelmatch diff + baseline approval. 4 MCP tools, 26 tests, validated end-to-end.
E2: Harness MVP	Idea	Docker compose up/down + seed + healthcheck via `boot_project`/`teardown_project`
E3: MCP Tool Surface	Partially done	Eyes tools shipped in v0.1. Harness + fallback browser tools in v0.2.
E4: Project Adapter Format	Idea	YAML adapter schema + zod validation + loader
E5: ~~Spec Runner~~ → QA Prompt/Skill	Reframed	Originally a deterministic spec executor. Now: a reusable prompt template or Claude Code skill that gives an agent a QA objective + success criteria. The agent uses judgment, not a script.
E6: Parallel Isolation	Idea	COMPOSE_PROJECT_NAME + port offset + per-agent state dirs
E7: Generic Browser Fallback	Idea	click, type, scroll, wait_for, get_dom — for apps without their own MCP tools
E8: Foundry Dogfood	Idea	First real consumer — agent uses Foundry MCP tools + Crucible eyes to QA Foundry's review panel, annotations, markdown rendering

Crucible — Project-Agnostic Test Harness & Visual Regression MCP#

Scope & Roadmap#

v0.1 — "Eyes Only" ✅#

v0.2 — "Harness + Fallback Browser"#

v0.3 — "Agentic QA"#

v1.0 — "Second Consumer"#

Non-Goals (v1)#

Competitive Landscape#

Open Questions#

Risks & Constraints#

Security Model#

AI Interface Architecture#

The Interaction Model — Three Layers#

App Maturity Spectrum#

MCP Tool Surface#

Exposure Strategy#

Why This Matters#

Agentic QA Pipeline#

The Pipeline#

Feature QA Prompt Template#

Regression QA Prompt Template#

QA Report Structure#

Data Model#

System Architecture#

Overview#

What Is This?#

Why It Exists#

Who Is It For?#

Target Use Cases#

Business Model#

Decisions Log#

Epic Index#

Review