Foundry Foundry

Crucible — Project-Agnostic Test Harness & Visual Regression MCP

One-line pitch: An agentic QA framework that gives AI agents eyes. Crucible provides visual verification, baseline management, and repeatable test environments — while agents use your app's own MCP tools (or generic browser tools as a fallback) for interaction. The agent is the test runner.

Status: Refining — v0.1 "Eyes Only" validated end-to-end, reframing vision based on agentic QA insight. Created: 2026-04-11 Updated: 2026-04-13 — major reframe from "Playwright wrapper + spec runner" to "visual verification layer for agentic QA" Third package in the @claymore-dev suite, alongside Foundry (doc review) and Anvil (semantic search).


Scope & Roadmap

v0.1 — "Eyes Only" ✅

  • 4 MCP tools: navigate, screenshot_page, compare_screenshots, approve_baseline
  • Hardcoded for Foundry test-env, no adapter format
  • Validated 2026-04-13: Full eyes flow working end-to-end via live MCP. Pixel-deterministic — 0 diff pixels across consecutive screenshots of the same page. Two baselines established (homepage, sample-doc-full).

v0.2 — "Harness + Fallback Browser"

  • E2 (Harness MVP): boot_project, teardown_project, seed, healthcheck
  • E4 (Adapter format): YAML project adapter schema
  • Generic browser fallback tools: click, type, scroll, wait_for
  • Success criterion: boot_project("foundry") → harnessed instance → agent uses Foundry MCP tools to interact + Crucible eyes to verify → teardown_project

v0.3 — "Agentic QA"

  • E6 (Parallel isolation): multiple agents can run Crucible simultaneously
  • E7 (QA Prompt/Skill): a reusable prompt template or Claude Code skill that orchestrates the full QA flow — "verify this branch against these baselines using available MCP tools"
  • E8 (Foundry dogfood): Foundry's review panel, annotation threading, and markdown rendering are covered by agentic QA
  • Success criterion: agent receives a QA prompt, autonomously explores Foundry using Foundry MCP tools + Crucible eyes, and produces a structured pass/fail report with visual evidence

v1.0 — "Second Consumer"

  • Onboard a second project (likely Routr or GMPPU race strategy)
  • Freeze the adapter format
  • Publish @claymore-dev/crucible on npm
  • Success criterion: a non-Foundry project's first agentic QA run produces useful results with no changes to Crucible core

Non-Goals (v1)

  • Not a unit test framework — vitest/jest/etc. own that layer
  • Not a Playwright replacement — Crucible exposes browser tools as a fallback, not as a primary interface. Apps should build their own MCP tools for the best experience.
  • Not a script runner — no deterministic step-by-step spec execution. The agent uses judgment, not a script.
  • Not a CI service — it's invoked locally by agents; CI integration is an optional extension
  • Not a visual design tool — no Figma-style compare, just pixel diffs
  • Not cross-device testing — desktop browsers only for v1 (mobile emulation is a fast-follow)

Competitive Landscape

Researched 2026-04-13. The agentic QA space is crowded with "AI writes scripts faster" but almost empty on "AI autonomously explores and finds issues."

Tool / CategoryWhat It DoesAgentic?Visual?MCP-Aware?
QA Wolf, Momentic, CarbonateAI translates natural language → Playwright scriptsNo — AI is the author, not the testerNoNo
Applitools, Percy, ChromaticVisual regression diffingNo — verification layer only, no interactionYesNo
OctomindAI discovers and generates e2e tests by crawlingPartial — outputs scripted testsNoNo
LaVague, BrowserUseLLM-driven browser agents (see page, decide action)Yes — but generic browser-level, no domain toolsScreenshot-basedNo
Playwright MCP (Microsoft)MCP server wrapping Playwright browser controlInfrastructure onlyNoYes
CrucibleVisual verification + baseline management + harness, designed for agents using app-specific MCP toolsYes — agent-as-test-runner with domain-tool interactionYes — coreYes — native

The gap Crucible fills: No existing tool combines app-specific MCP tools for semantic interaction + visual verification + agent judgment in an autonomous QA loop. The pieces exist separately; nobody has composed them.


Open Questions

Most original open questions were resolved in the 2026-04-13 design review. Remaining:

  • QA prompt template validation: The markdown template needs to be tested with a real Foundry QA run. Does the agent get enough guidance from the template, or does it need more structure? First test will tell us.
  • Report format portability: The QA report is markdown now. Should it also have a structured JSON representation for programmatic consumption by orchestrators? Or is markdown parsing good enough?
  • Baseline approval UX for humans: When the QA agent proposes baseline updates, how does the human review them? Just look at the PNGs on disk? A diff viewer? This matters once baselines start accumulating.
  • Multi-page QA sessions: How does the agent decide which pages to visit during exploratory QA? Does the prompt template need to list known routes, or can the agent discover them from navigation/sitemap?

Risks & Constraints

Security Model

AI Interface Architecture

This section is the whole point of the project.

The Interaction Model — Three Layers

Agentic QA involves three layers of tooling. Crucible's design is built around where each layer's boundary falls:

LayerWhat It DoesWho Owns ItExamples
Eyes (visual verification)Screenshot, diff, baseline management, verdictsCrucible (core)screenshot_page, compare_screenshots, approve_baseline
Harness (environment)Boot containerized env, seed, teardown, healthcheckCrucible (core)boot_project, teardown_project, seed
Generic browser interactionNavigate, click, type, scroll, waitCrucible (fallback)navigate, click, type, scroll
Domain-specific interactionSemantically meaningful app actionsThe app's own MCP toolsFoundry's create_annotation, submit_review

The principle: Crucible is most powerful when agents interact through domain-specific MCP tools — they operate at the semantic level of the application, not the pixel level. An agent that calls create_annotation("This needs review") understands what it's doing in a way that click(452, 318) never can.

But not every app has MCP tools. Crucible's generic browser interaction layer (navigate, click, type, scroll) exists as a fallback — training wheels that make Crucible useful on day one for any web app with a URL. As apps mature into their own MCP tools, the agent naturally shifts to those.

App Maturity Spectrum

App MaturityInteraction MethodQA FidelitySetup Required
Has rich MCP tools (e.g. Foundry)Agent uses app's own MCP toolsHighest — semantic interaction, the agent understands what it's doingApp MCP server + Crucible
Has some MCP toolsMix of app tools + Crucible browser fallbackHigh — semantic where possible, pixel-level where notPartial app MCP + Crucible
No MCP tools, just a URLCrucible's generic browser tools (click, type, scroll)Moderate — works but fragile, selector-dependentCrucible only

MCP Tool Surface

Eyes (core — always available):

ToolPurpose
navigateGo to a URL. Launches browser on first call. Returns final URL + status.
screenshot_pageFull-page or viewport PNG screenshot. Cached in-session for subsequent compare.
screenshot_elementElement-scoped screenshot by selector.
compare_screenshotsDiff a screenshot against a stored baseline. Returns match score + verdict.
approve_baselineWrite current screenshot as the baseline for a project/spec.
list_baselinesList baselines for a project/spec.

Harness (core — available when project adapters are configured):

ToolPurpose
boot_projectBoot a project by adapter name. Returns handle + entry URL.
teardown_projectClean up containers, networks, volumes for a handle.
seedRun the project's seed command inside the harnessed container.
healthcheckProbe the harnessed app until it responds or times out.

Generic browser interaction (fallback — for apps without their own MCP tools):

ToolPurpose
clickClick an element by selector.
typeType text into an input by selector.
scrollScroll the page or an element.
wait_forWait for a selector, network idle, or timeout.
get_domReturn current DOM snapshot (for agent reasoning about page state).

Not in Crucible — belongs to the app:

Domain-specific tools like create_annotation, submit_review, add_to_cart, create_user — these are semantically meaningful actions that only the app knows about. Crucible can't and shouldn't try to own these. Apps that want the best agentic QA experience should expose their own MCP tools for the actions agents need to perform during testing.

Exposure Strategy

EnvironmentAvailable?How?
DevelopmentYesAlways-on when Crucible MCP is configured
StagingN/ACrucible is a local tool, no hosted staging
ProductionN/ASame — no hosted production

Why This Matters

Traditional visual regression is expensive: someone writes Playwright scripts, maintains them, babysits CI. AI-assisted testing (Momentic, Carbonate) just makes script authoring faster — the model is still scripted, still brittle.

Agentic QA with Crucible is fundamentally different: the implementation agent commits its work, then a separate QA agent verifies it — independent verification, not self-review. The QA agent uses domain-specific tools to interact with the app at a semantic level, uses Crucible to see the result, and applies judgment to decide whether it looks right. Meanwhile, a parallel regression agent sweeps all existing baselines for drift. No scripts to maintain. No selectors to update. The agents adapt.


Agentic QA Pipeline

This is the end-to-end flow that Crucible enables. A separate QA agent verifies the implementation agent's work — independent verification, not self-review.

The Pipeline

  1. Orchestrator agent receives a feature request
  2. Fires implementation sub-agent → writes code, runs unit tests, commits
  3. Fires two QA sub-agents in parallel:
    • Feature QA agent — focused on the specific change. Gets a prompt template with success criteria, what changed, and targeted pages. Verifies the feature works as intended.
    • Regression QA agent — focused on everything else. Discovers all existing baselines via list_baselines and sweeps them for drift. No manual test list — the baseline store IS the regression suite.
  4. Both QA agents use app MCP tools + Crucible eyes to verify
  5. Both return structured markdown reports → orchestrator decides: merge, or fire a fix agent
  6. If issues found: fix agent makes changes → QA agents run again → loop until clean
  7. Escalation threshold: if 3 iterations produce no progress, escalate to human (NEEDS_HUMAN verdict)

Feature QA Prompt Template

The orchestrator fills this in based on what the implementation agent changed. The QA agent treats success criteria as must-pass and the "also check" section as exploratory. No scripted steps — the agent uses judgment.

## QA Objective
Verify [feature description]

## What Changed
[Orchestrator fills this from the implementation agent's output —
files changed, components affected, summary of the change]

## Where to Look
- Primary pages: [URLs where the feature change is visible]
- Entry point: [how to navigate to the feature]
- Related pages: [pages that might be affected by the change]

## Success Criteria
- [ ] [specific visual/functional check]
- [ ] [specific visual/functional check]
- [ ] [specific visual/functional check]

## Also Check
- Regression in [related areas]
- [Known fragile areas from baselines]

## Available Tools
- App MCP tools: [list of domain-specific tools available]
- Crucible: navigate, screenshot_page, compare_screenshots, approve_baseline
- Crucible baselines: [list of existing baselines for this project]

Regression QA Prompt Template

The regression agent discovers its test suite from the baseline store — no manual maintenance needed. New baselines appear automatically as features are approved and merged.

## Regression Suite — [Project Name]
Run `list_baselines(project: "[project]")` to get the full baseline list.
For each baseline:
1. Navigate to the baseline's URL
2. Screenshot the page
3. Compare against the stored baseline
4. Flag any diffs that exceed tolerance

## Known Fragile Areas
- [Manually curated warnings about areas prone to false positives]

## Verdict Rules
- All baselines match: PASS
- Any baseline drifted unexpectedly: ISSUES_FOUND (include evidence)
- Uncertain whether drift is intentional: NEEDS_HUMAN

This starts as a markdown template, not a skill. Validate the pattern with Foundry as the first consumer. Codify into a reusable skill once the pattern proves out.

QA Report Structure

Both QA agents return this format to the orchestrator. Markdown — LLMs read/write it natively, humans can scan it directly, no JSON parsing needed. Revisit only if a programmatic consumer appears.

## Verdict: PASS | ISSUES_FOUND | NEEDS_HUMAN

## Findings
### [Finding 1 — severity: high/medium/low]
- **What:** [description of the issue]
- **Where:** [page/component/URL]
- **Evidence:** [screenshot path, baseline path, diff score]
- **Suggested fix:** [if the agent has an opinion]

### [Finding 2...]

## Baselines
- **Updated:** [baselines intentionally updated — agent judged the change as correct]
- **New:** [new baselines proposed for new pages/components]
- **Failed:** [baselines where the diff exceeded tolerance]

## Coverage
- **Pages visited:** [list of URLs]
- **MCP tools used:** [app tools + Crucible tools invoked]
- **Areas not checked:** [anything the agent couldn't reach or chose to skip]

Verdicts:

  • PASS — all success criteria met, no regressions detected, baselines match or intentionally updated
  • ISSUES_FOUND — specific problems identified with evidence. Orchestrator fires fix agents per finding.
  • NEEDS_HUMAN — agent isn't confident enough to pass or fail. Better to escalate than silently pass something wrong.

Retry loop: orchestrator reads findings → fires fix agent with finding details as context → fix agent commits → QA agents run again. Max 3 iterations before escalating to human.


Data Model

System Architecture

Overview

What Is This?

Crucible is an agentic QA framework — it gives AI agents the ability to see what they changed and verify it visually. It combines three concerns:

  1. Eyes — Headless browser screenshots, pixel-level diffs against stored baselines, baseline approval workflow. The agent can see the UI and compare it against known-good states.
  2. Harness — Docker-compose orchestration, deterministic seed layer, parallel isolation. The agent can boot a known-good environment every time, the same way.
  3. Generic browser interaction (fallback) — Navigate, click, type, scroll. Available for apps that don't have their own MCP tools, but positioned as training wheels, not the primary interaction model.

The key insight: Crucible does NOT try to be a Playwright wrapper. The best agentic QA happens when agents interact with apps through domain-specific MCP tools — Foundry's create_annotation, not click at (x, y). Crucible provides the visual verification layer; your app's own tools provide the semantic interaction. For apps without MCP tools, Crucible's generic browser tools are a workable fallback.

This is a fundamentally different approach from existing AI testing tools, which focus on "AI writes Playwright scripts faster." In Crucible's model, the agent is the test runner — it uses judgment, explores, and decides what to verify. No scripted steps, no brittle selectors, no predetermined assertions.

Why It Exists

Agents can write code and run tests, but they're blind to UI. Every feature pipeline ends at the same handoff: "…and now the human visually QAs it and merges." That's the bottleneck.

Existing AI testing tools don't solve this — they help humans write scripts faster (Momentic, Carbonate, QA Wolf) or do visual diffing without agent integration (Applitools, Percy). Nobody has built the combination of:

  • App-specific MCP tools for semantically meaningful interaction (not blind pixel clicking)
  • Visual verification with baseline management (screenshot + diff + approval)
  • Agent judgment for exploration, prioritization, and pass/fail decisions

The market is full of "AI helps you write tests faster." It's empty on "AI autonomously QAs using domain-specific tools + visual verification." That's the gap Crucible fills.

Crucible closes the loop: implementation agent opens a PR → Crucible boots a clean environment → the QA agent uses the app's own MCP tools to interact and Crucible's eyes to verify → returns pass/fail with visual evidence → orchestrator merges or fires a fix agent.

Who Is It For?

  • AI coding agents running feature pipelines that need to verify UI changes — the primary audience
  • Developers who want agentic QA for their web projects without writing Playwright scripts
  • Apps with MCP tools — Crucible is most powerful here. The agent uses your app's semantic tools for interaction and Crucible for visual verification
  • Apps without MCP tools — Crucible's generic browser fallback tools still work. Lower fidelity, but zero setup beyond a URL

Target Use Cases

ProjectHow Crucible + App MCP Tools Work Together
FoundryAgent uses Foundry MCP tools (create_annotation, submit_review, etc.) to interact, Crucible to screenshot and verify review panel rendering, annotation threading, markdown display
Routr / CNC appAgent uses app tools to create toolpaths and configure jobs, Crucible to verify 2D/3D canvas rendering matches baselines
Any web app with MCP toolsAgent uses domain tools for semantic interaction + Crucible for visual verification. Best experience.
Any web app without MCP toolsAgent uses Crucible's generic browser tools (click, type) + Crucible's eyes. Works out of the box with just a URL.

Business Model

Internal-first, open-source-second. Crucible is built to unblock Claymore's agent pipelines. Once stable, it's a strong candidate for @claymore-dev publication alongside Foundry and Anvil — agentic QA with first-class visual verification is genuinely differentiated and unoccupied territory in the MCP ecosystem.

No direct monetization planned for v1. Long-term, hosted baseline storage + team-shared approval workflow is a plausible paid tier.


Decisions Log

DateDecisionRationaleAlternatives Considered
2026-04-11Renamed from Lookout to CrucibleOn-brand with Foundry/Anvil/Claymore metalworking lineup; expanded scope needed a bigger name ("crucible" = severe test)Caliper, Dial, Jig, Plumb
2026-04-11Merged "orchestration harness" (originally foundry E13) with visual regression into one projectTwo halves of the same problem — eyes and a repeatable environment. Separate projects would duplicate Docker/MCP/session codeKeep harness as a separate sister project
2026-04-11MCP-native from day one, not CLI-firstAgents are the primary users. CLI is a fallback for humans.CLI-first with MCP wrapper later
2026-04-11Docker-based harness, not host-nativeReproducibility and parallel isolation matter more than 30-60s cold-start costHost-native scripts with clean-state hooks
2026-04-11Baselines stored on local filesystem, not cloudSimplicity for v1; local-first matches the Claymore stackS3, Git LFS, cloud-hosted baseline service
2026-04-13Reframed from "test harness + visual regression" to "agentic QA framework"Market research showed the space is full of "AI writes scripts faster" but empty on "AI autonomously QAs using domain tools + visual verification." The agent-as-test-runner model is genuinely novel and unoccupied.Keep the scripted spec runner approach
2026-04-13Three-layer interaction model: eyes (core) + harness (core) + generic browser (fallback)Crucible is most powerful when apps have their own MCP tools. But requiring app MCP tools kills adoption. Generic browser tools serve as training wheels.(A) Eyes-only, require Playwright MCP alongside; (B) Full browser tools as primary; (C) Tiered ← chosen
2026-04-13Removed scripted spec runner from core architectureThe agent receives a QA prompt + success criteria and uses judgment. No deterministic step-by-step execution.Keep spec runner as an alternative mode
2026-04-13Two-agent QA model: feature QA + regression QA in parallelSeparating concerns keeps scope manageable. Feature agent checks the specific change; regression agent sweeps all baselines. Run in parallel for speed. Neither agent is overwhelmed.Single agent does both feature + regression checking
2026-04-13Regression suite derived from baseline store, not manually maintainedlist_baselines returns all baselines — that IS the regression suite. No manual curation, no drift. New baselines appear automatically via approve_baseline.Manually maintained regression prompt (prone to drift)
2026-04-13Separate QA agent verifies implementation agent's workIndependent verification is more trustworthy than self-review. The agent that wrote the code should not be the agent that judges it.Same agent implements and verifies
2026-04-13QA prompt template (markdown, not a skill)Start with a simple markdown template. Validate with Foundry first. Codify into a skill only after the pattern proves out.Claude Code skill from day one, structured YAML
2026-04-13Structured QA report in markdown with PASS/ISSUES_FOUND/NEEDS_HUMAN verdictsReport must be actionable for the orchestrator. Markdown because LLMs read/write it natively and humans can scan it. JSON only if a programmatic consumer appears.JSON report, simple pass/fail boolean
2026-04-133-iteration retry limit before human escalationAutonomous fix loops need a circuit breaker. 3 fix→QA cycles with no progress → escalate.No limit (risky), 1 attempt (too conservative)
2026-04-13Baseline drift: QA agent can propose updatesAgent judges "changed but correct" → proposes baseline update in report. Orchestrator/human approves batch.Only humans can update baselines
2026-04-13Screenshot output: save to file, return pathAgents need to see to judge correctness. Save PNG to file, return path + metadata. No base64 in MCP response.Return base64 inline (overflows), metadata-only (can't judge)
2026-04-13Auth: no-auth + cookie injection for v1Most test envs don't need auth. Cookie injection via Playwright's context.addCookies() for those that do. storageState export as OAuth workaround.Build OAuth automation for v1
2026-04-13Dynamic content masking deferred to v2Agent judgment + networkidle + prompt conventions handle 80%. Build masking when real usage shows what breaks.Build masking infrastructure for v1
2026-04-13Computer-use integration deferredcompare_screenshots already accepts pngBase64 from any source. Formal integration when a native-app QA use case arises.Build integration for v1

Epic Index

EpicStatusSummary
E1: Eyes MVPDone (v0.1)Screenshot + pixelmatch diff + baseline approval. 4 MCP tools, 26 tests, validated end-to-end.
E2: Harness MVPIdeaDocker compose up/down + seed + healthcheck via boot_project/teardown_project
E3: MCP Tool SurfacePartially doneEyes tools shipped in v0.1. Harness + fallback browser tools in v0.2.
E4: Project Adapter FormatIdeaYAML adapter schema + zod validation + loader
E5: Spec Runner → QA Prompt/SkillReframedOriginally a deterministic spec executor. Now: a reusable prompt template or Claude Code skill that gives an agent a QA objective + success criteria. The agent uses judgment, not a script.
E6: Parallel IsolationIdeaCOMPOSE_PROJECT_NAME + port offset + per-agent state dirs
E7: Generic Browser FallbackIdeaclick, type, scroll, wait_for, get_dom — for apps without their own MCP tools
E8: Foundry DogfoodIdeaFirst real consumer — agent uses Foundry MCP tools + Crucible eyes to QA Foundry's review panel, annotations, markdown rendering

Review

🔒

Enter your access token to view annotations