E2: MCP Servers — Epic Design Doc

Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-18 Parent: QuoteAI Project Design Doc

Overview

Goals & Non-Goals

Goals:

Build two MCP servers: @quoteai/equipment and @quoteai/quotes
Expose tools that let Claude Code (demo) or a future Anthropic-API-backed app (Full MVP) retrieve the exact context needed to assemble a draft quote
Wire the servers into the Claude Code config so the demo experience works end-to-end
Ship a templates/brehob-quote.md reference artifact extracted from the 4M Industries doc, so CC has the target template

Non-Goals:

No pricing tools (pricing is the salesperson's job, per John)
No write tools (no creating/updating quotes via MCP — CC or the app does that directly)
No inventory/CRM tools (post-MVP)
No auth on MCP servers (stdio transport only for Demo)
No SSE/hosted transport (post-MVP)

Problem Statement

The MCP servers are the bridge between the ingested data (E1) and the generation layer (Claude Code in Demo, app in Full MVP). Without them, CC has no way to query the description library and will hallucinate equipment language instead of using John's proven phrasing.

Every MCP tool either makes the demo better (better retrieval = better draft) or adds friction. Ship the minimum set that serves the user flow in the main design doc; nothing else.

What Is This Epic?

Two independent MCP servers (stdio transport, TypeScript, @modelcontextprotocol/sdk) exposing the tools needed for quote assembly:

@quoteai/equipment — product catalog search + lookup
@quoteai/quotes — past-quote semantic search + description retrieval

Both servers share a Postgres connection to the pgvector DB populated by E1. CC is configured to load both during the demo.

Context

Dependents

Claude Code demo experience — CC calls these tools every time it assembles a draft
E3 (UI) — the UI triggers a flow that ends with CC assembling a draft via these tools
Future Full MVP app — the in-app Anthropic API call uses the same tools

Dependencies

E0 (Foundation) — DB connection layer, env conventions
E1 (Ingestion + Vector DB) — hard dependency; MCP tools are useless without data

Current State

No MCP servers exist. The design doc sketches the tool surface; E2 builds it. Anvil's MCP server design (projects/anvil/epics/mcp-tools.md) is a useful reference pattern, but tools are QuoteAI-specific (not reused).

Affected Systems

System / Layer	How It's Affected
`mcp-servers/equipment/`	Fully built — TypeScript package with MCP SDK
`mcp-servers/quotes/`	Fully built — TypeScript package with MCP SDK
Claude Code config	Updated to load both servers during demo
Postgres	Read-only queries from both servers
`templates/brehob-quote.md`	New reference artifact — the 4M template codified

Design

Tool Surface

`@quoteai/equipment`

Tool	Params	Returns	Purpose
`search_equipment`	`query: string`, `cfm_min?: number`, `cfm_max?: number`, `psi_min?: number`, `psi_max?: number`, `hp_min?: number`, `hp_max?: number`, `top_k?: number`	Array of `{ product, score, snippet }`	Semantic + structured-filter search over product catalog. All numeric filters are min/max pairs — for exact-HP lookup CC passes `hp_min=100, hp_max=100`.
`get_product`	`model: string`	Full product row (all spec fields)	Exact model lookup
`get_specs`	`models: string[]`	Array of product rows, one per model	Side-by-side comparison for multi-option quotes

`@quoteai/quotes`

Tool	Params	Returns	Purpose
`search_past_quotes`	`query: string`, `top_k?: number`	Array of `{ quote, score, summary }`	Find quotes similar to an overall project description
`get_quote`	`quote_id: string`	Full past_quotes row + line items	Full context for a specific past quote
`search_line_items`	`query: string`, `top_k?: number`	Array of `{ line_item, score, source_quote }`	The description library query — finds proven language at line-item granularity. Vector-only for MVP (see note).

Note on line-item filters. quote_line_items has no structured spec columns today — just description, quantity, prices, product_id FK, embedding, markdown (see db/migrations/001_init.sql). Haiku extracts HP/CFM/PSI as part of the description text, not as structured fields, and product_id isn't populated by the loader, so neither direct nor JOIN-based filtering works without schema changes + re-ingest. MVP accepts vector-only retrieval here — the verbatim description already contains literals like "100HP" / "oilless" / "food-grade" which the embedding picks up.

Post-demo follow-up — if vector-only underperforms on real queries, the cheapest structured filter to add is category?: string (e.g., "ROTARY SCREW AIR COMPRESSOR"). The extractor already produces this per line item via LineItemSchema.category; it just isn't stored. One ALTER + backfill (or re-ingest) adds it.

get_descriptions considered and dropped — it was a thin wrapper over search_line_items with a different input shape. CC can construct the equivalent query string directly; two tools doing similar things raises the odds of CC picking the wrong one.

Hybrid Search Logic (Critical)

Pure vector search misses filter-style constraints ("100HP", "food-grade"). Hybrid approach applies to search_equipment only — search_line_items is vector-only for MVP (see Tool Surface note).

When any structured filter is provided — filter-then-rank:

Generate query embedding via OpenAI
SELECT ... WHERE <structured filters> ORDER BY embedding <=> $query_embedding LIMIT $top_k
Return top_k

Filter-first ensures hard constraints like "exactly 100HP food-grade" are enforced. The vector step ranks within the eligible set, so a qualifying row can't be missed because it fell outside the N nearest neighbors by embedding distance. This is the opposite of rank-then-filter — which was the earlier proposal but only works when filters are soft signals, not hard requirements.

When no structured filters — rank-only:

Generate query embedding via OpenAI
SELECT ... ORDER BY embedding <=> $query_embedding LIMIT $top_k
Return top_k

If hybrid underperforms (golden test fails): add BM25 full-text search via Postgres ts_vector and merge scores with rrf (reciprocal rank fusion). Deferred until we see where the gaps actually are.

Template Reference

The 4M Industries template is compressor-specific (CFM/PSI/cooling). Brehob also quotes vacuums, dryers, blowers — each has a different natural spec set. Rather than one monolithic template, split into an outer skeleton plus per-category partials.

Outer skeleton — templates/brehob-quote.md:

Header (date, proposal number, company info, attention, salutation)
Intro paragraph ("Brehob Corporation is pleased to have the opportunity…")
Per-line-item placeholder — CC inserts one characteristics block per line item, selected by that line's extracted category
Installation (conditional, when present)
Exclusions, Totals, Terms (Net 30, 30-day validity, FOB Factory)
Capability pitch, signature, footer (five Brehob offices)

Per-category characteristics partials — templates/characteristics/<category>.md:

compressor.md — Manufacturer, Series/Model, Cooling, Pressure (PSI), Capacity (CFM), Electric Motor (HP/RPM/ODP/SF), Voltage, Drive System, Dimensions, Weight
Additional categories (vacuum, dryer, blower, etc.) ship only when a category appears in the ingested subset. The LineItemSchema.category field in ingestion/extractor/schemas.ts drives the lookup — same taxonomy selects the partial AND (post-demo) the optional category filter on search_line_items.

Demo scope — compressor only unless the ingested subset already contains another category that we want in the demo flow. Spot-check the DB during S7 (SELECT DISTINCT category FROM quote_line_items … once the markdown is extracted, or eyeball the line-item markdown) to decide whether to ship a second partial (Henry Ford is noted as "vacuum" in the session handoff — confirm and either add vacuum.md or leave Henry Ford out of the demo golden path).

Compressor characteristics block (first cut — extract verbatim shape from 4M reference during S7):

[CATEGORY IN CAPS]

CHARACTERISTICS
Manufacturer:    [value]
Series / Model:  [value]
Cooling:         [Air/Water]
Pressure:        [PSI]
Capacity CFM:    [value]
Electric Motor:  [HP, RPM, ODP, SF]
Voltage:         [value]
Drive System:    [value]
Dimensions:      [L x W x H]
Weight:          [lbs]

[Warranty block — manufacturer-specific]

Model [X] as described above    Net $[PRICE] Each
Delivery: [estimated timeframe]

CC reads the outer skeleton plus the matching characteristics partial(s) when assembling a draft. This is the "house template" contract.

Data Model Changes

None — all queries are read-only against E1's data.

API / Interface Changes

Each MCP server:

Exports via stdio transport (simplest; CC spawns as child process)
Uses @modelcontextprotocol/sdk with standard tool registration
Loads DATABASE_URL and OPENAI_API_KEY from app/.env.local at startup — same source of truth as the ingestion CLI and Next.js app. Avoids re-specifying secrets in the CC config and avoids depending on CC's shell env.
Ships as an npm workspace package under mcp-servers/* (local dev only — not published)

Env-loading pattern matches the ingestion CLI (which already solved this): use dotenv preload with DOTENV_CONFIG_PATH pointing at app/.env.local. CC config references the compiled entrypoint plus the dotenv preloader; the path to .env.local is absolute (CC's mcp.json does not interpolate workspace variables).

CC config example (replace <repo-root> with the absolute path to ~/Documents/Code/quoteai):

{
  "mcpServers": {
    "quoteai-equipment": {
      "command": "node",
      "args": ["-r", "dotenv/config", "<repo-root>/mcp-servers/equipment/dist/index.js"],
      "env": { "DOTENV_CONFIG_PATH": "<repo-root>/app/.env.local" }
    },
    "quoteai-quotes": {
      "command": "node",
      "args": ["-r", "dotenv/config", "<repo-root>/mcp-servers/quotes/dist/index.js"],
      "env": { "DOTENV_CONFIG_PATH": "<repo-root>/app/.env.local" }
    }
  }
}

S8 ships a short README snippet walking through the absolute-path substitution so first-time demo setup isn't a scavenger hunt.

Edge Cases & Gotchas

Scenario	Expected Behavior	Why It's Tricky
Query with no matches	Return empty array, not error	CC should be able to distinguish "no results" from "tool broken"
Query exceeds `top_k` available rows	Return whatever exists; don't pad	Small DB (curated subset) will hit this often
Concurrent queries from CC	Both servers must handle via connection pool	CC may fan out queries in parallel
Embedding API outage	`search_*` tools return error with clear message	CC should degrade gracefully — fall back to keyword match if possible (post-MVP)
Model field mismatch in `get_product`	Return 404-style error, not empty result	CC needs to know the model doesn't exist vs. server issue
Very long description blocks (> 2k tokens)	Return as-is; let CC decide how to truncate	Some 4M-style quotes have dense installation blocks

Testing Strategy

Test Layers

Layer	Applies?	Notes
Unit tests	Yes	Query builders, filter composition, error handling
Integration tests	Yes	Spin up a test Postgres with fixture data; call each tool; verify shape + content
Golden retrieval test (from E1)	Yes	Same golden test — does `search_line_items` return expected items for the golden query?
E2E with CC	Yes	Configure CC to load both servers; run a demo quote flow manually; does it assemble correctly?

Required Fixtures

Fixture Name	What It Tests	Priority
`fixtures/mcp-golden-query.test.ts`	`search_line_items("100HP oilless food grade")` returns Groeb/4M/Powerex in top 5	🔴 High
`fixtures/mcp-get-product.test.ts`	`get_product("QMB30")` returns complete spec	🔴 High
`fixtures/mcp-empty-results.test.ts`	Nonsense query returns `[]`, not error	🟡 Medium

Verification Rules

Every tool has at least one integration test against seeded DB.
Golden retrieval test runs green before demo.
Manual E2E walkthrough before showing anyone — open CC, fill mock form, watch it call tools, inspect output against the 4M template.

Stories

Stories are split into two phases. Phase A is the vertical slice — the minimum set that answers "can CC assemble a good draft from these tools?" If Phase A fails, Phase B is noise; we'd re-scope. If Phase A passes, Phase B fills in breadth.

Phase A — Vertical slice (demo-path minimum):

Story	Summary	Status	PR
S3	`@quoteai/quotes` scaffold — MCP SDK, stdio transport, DB client, `app/.env.local` loader	—	—
S5	`@quoteai/quotes` `search_line_items` — the critical one, vector-only	—	—
S7	`templates/brehob-quote.md` outer skeleton + `templates/characteristics/compressor.md` partial (extracted from 4M reference)	—	—
S8	CC config wiring (absolute paths + `DOTENV_CONFIG_PATH`) + manual E2E demo walkthrough against golden scenario ("100HP oilless compressor for food-grade plant")	—	—

Phase A explicitly skips the equipment server. If the spike shows CC needs product-spec lookups to fill template fields cleanly, we promote S0 + S2 out of Phase B before finishing Phase B's other work.

Phase B — Breadth (fills in after Phase A demonstrates the loop):

Story	Summary	Status	PR
S0	`@quoteai/equipment` scaffold — MCP SDK, stdio transport, shared DB client / env loader pattern from S3	—	—
S1	`@quoteai/equipment` `search_equipment` implementation + filter-then-rank hybrid	—	—
S2	`@quoteai/equipment` `get_product` + `get_specs`	—	—
S4	`@quoteai/quotes` `search_past_quotes`	—	—
S6	`@quoteai/quotes` `get_quote`	—	—

Known Issues / Tech Debt

Issue	Severity	Notes
No rate limiting on server-side	🟢 Low	stdio = single client (CC) — not a risk
No caching on query embeddings	🟡 Medium	Same query gets re-embedded every call. Add LRU cache in Full MVP.
No observability (metrics, tracing)	🟡 Medium	Add once we're iterating on retrieval quality and want to see what CC is actually asking
Hybrid search ranking is naive	🟡 Medium	May need RRF or learned-to-rank. Evaluate post-demo.

Risks

Risk	Likelihood	Impact	Mitigation
Retrieval quality insufficient for good drafts	Medium	🔴 High	Golden test is the gate. Iterate on query construction + hybrid filters until it passes.
CC doesn't use the tools well (hallucinates instead)	Medium	🔴 High	Test the demo flow against the golden scenario manually before showing John. Adjust tool descriptions (the `description` field in MCP tool registration) to nudge CC toward calling them.
MCP SDK version incompatibility with CC	Low	Medium	Pin `@modelcontextprotocol/sdk` version; test locally before demo
Postgres connection exhaustion under concurrent CC calls	Low	Low	Use `pg-pool` with a small cap; stdio = one client anyway
Embedding-based filtering is too loose (returns unrelated items)	Medium	🟡 Medium	Structured filters + a similarity threshold (`score > 0.7`) to drop weak matches

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-04-18	Two independent servers (equipment + quotes)	Matches design doc; separation of concerns; smaller tool surfaces per server	One server with all tools (rejected: scope creep, messier tool list)
2026-04-18	stdio transport for Demo	Simplest; CC spawns as child process	SSE (rejected: premature for local demo)
2026-04-18	Hybrid search (vector + structured filter) as default	Pure vector misses hard constraints like HP range	Pure vector (rejected: tested mentally against golden scenario, would miss items)
2026-04-18	`search_line_items` is the primary retrieval tool	Matches E1's "line items are the atomic unit" decision	`search_past_quotes` as primary (rejected: too coarse for description language)
2026-04-18	4M template extracted into `templates/brehob-quote.md`	CC needs an explicit target format	Inline in system prompt (rejected: hard to iterate, unversioned)
2026-04-18	No MCP write tools	Per design doc — MCP is retrieval only	Write tools for quote_log (rejected: Full MVP scope)
2026-04-18	Defer BM25/RRF hybrid to post-golden-test	YAGNI until the golden test shows the gap	Build BM25 upfront (rejected: premature optimization)
2026-04-20	`get_descriptions` scrapped	Thin wrapper over `search_line_items` with a different input shape; two similar tools raises the odds CC picks the wrong one	Keep as convenience (rejected: not pulling its weight)
2026-04-20	All structured numeric filters are min/max pairs on `search_equipment`	Consistency over cleverness — CC passes `hp_min=100, hp_max=100` for exact matches, ranges when it has them	Mixed scalars/ranges (rejected: asymmetric surface); `hp` + internal tolerance (rejected: hides the control from CC)
2026-04-20	Filter-then-rank when any structured filter present	Hard constraints ("exactly 100HP food-grade") can't be lossy. Rank-then-filter risks dropping qualifying rows that fell outside the N nearest neighbors	Rank-then-filter with 3× buffer (rejected: only works for soft filters, which isn't the golden-scenario shape)
2026-04-20	`search_line_items` is vector-only for MVP	`quote_line_items` has no structured spec columns (confirmed in `001_init.sql`); Haiku extracts specs into `description` text, not typed fields; `product_id` FK isn't populated by the loader so JOIN-based filter is blocked too	Extend schema + re-extract (rejected: too much E1 churn for MVP); add `category` filter (deferred as post-demo follow-up — cheap since `LineItemSchema.category` is already extracted, just not stored)
2026-04-20	Vertical-slice story ordering (Phase A = S3/S5/S7/S8, Phase B = the rest)	The biggest unknown is "can CC assemble a good draft?" — we answer that with the minimum viable loop before building breadth. If Phase A fails, everything else is wasted	Breadth-first (rejected: delays the feedback that most changes the design)
2026-04-20	Env loading via `dotenv/config` preload pointing at `app/.env.local`	Single source of truth with ingestion CLI + app; no secrets in `~/.claude/mcp.json`; no dependence on CC's shell env	Inline `env` block in mcp.json (rejected: secret sprawl, painful to rotate); shell env (rejected: fragile across CC restarts)
2026-04-20	Per-category template partials (outer skeleton + `characteristics/<category>.md`)	Different equipment categories have genuinely different spec sets (compressor CFM/PSI vs vacuum inHg/ACFM vs dryer dewpoint); one monolithic template would either over-constrain or under-constrain. Reuses the `LineItemSchema.category` taxonomy	One monolithic template with CC adapting (rejected: CC ends up guessing field names across categories); Zod-per-category (rejected: overkill for MVP)
2026-04-20	Demo scope = compressor category only	Ingested subset is compressor-heavy; ship vacuum/dryer/blower partials only when those categories appear in the golden flow	Ship all categories upfront (rejected: premature for demo)
2026-04-21	dotenv imported inside src/index.ts, NOT via `node -r dotenv/config` preload (commit 6338ff0)	pnpm doesn't hoist dotenv to the repo root; `-r` resolves from cwd's node_modules so the preload phase fails when CC spawns the server from the project root. File-relative ESM resolution via in-module import reaches the package's own node_modules through pnpm's symlink tree	Hoist dotenv via `shamefully-hoist` (rejected: heavy-handed for one module); CWD manipulation in the launch args (rejected: fragile)
2026-04-21	Pool sets `ivfflat.probes = 20` via Postgres startup `-c` option (commit 3c2ea17)	001_init.sql built ivfflat with lists=20 for eventual thousands-of-rows scale, but MVP corpus is <100 rows per table. Default probes=1 can land the query vector in an empty list and return 0 candidates even when matching data exists (actually observed for search_line_items and search_past_quotes on semantically distant queries). `-c` at connection startup is race-free, unlike `pool.on('connect')` which lets the client be checked out before the async SET completes. Follow-up: drop or reduce this setting once ingestion grows past the lists count, otherwise probes=20 scans all lists and negates the index's pruning value	`pool.on('connect')` with SET (rejected: race condition, pg deprecation warning); per-query SET LOCAL in a txn (Agent B's first attempt; rejected: doesn't cover search_line_items + search_equipment); rebuild indexes with smaller lists (deferred: touches 001_init migration)
2026-04-21	Inline draft conventions formalized in outer template (commit 5b369c0) — emerged organically in Phase A CC walkthrough, promoted to house style	Three patterns CC invented without prompting turned out to be genuinely useful: (1) SALESPERSON REVIEW blockquote for retrieval gaps / scope notes needing scrubbing / mismatched analogs; (2) attribution notes above verbatim blocks with source customer + quote number + date + score; (3) end-of-draft retrieval summary table. Formalizing them in the template header makes the behavior robust across models (validated later in Sonnet run — same conventions appeared)	Let each model discover them (rejected: Sonnet might not; comparison showed model-variant deltas that templates resolve better than prompt engineering)
2026-04-21	Added `lubrication` enum to products + `search_equipment` filter (commit 7f4ae10)	Phase B walkthrough exposed that Q$ync 100 (oil-flooded) matched "100HP oilless food-grade" at the top of the catalog because the ingested text has no explicit lubrication signal. Structured filter enforces a hard constraint CC can't bypass. Values: `oilless` / `oil-free` / `oil-flooded` / null. Products with NULL lubrication are EXCLUDED from filtered results — "unknown" doesn't fake a match. Extractor prompt bumped to v2 to populate on future ingest; existing 3 rows backfilled via UPDATE (all confirmed oil-flooded by manual inspection of their spec sheets).	Re-rank penalty based on query terms (rejected: unreliable; "oilless" as a query keyword can't differentiate in-source vs not-in-source); ignore the gap (rejected: golden-scenario blocker)
2026-04-21	Confidence bands on search hits with fixed thresholds (commit 3c79efb)	A flat score distribution across bad retrievals is indistinguishable from a flat distribution with one good match. The band label lets CC read "all my top-K are probable_miss" as a retrieval failure signal without threshold-guessing. Thresholds: ≥0.7 `likely_good`, 0.4–0.7 `analog`, <0.4 `probable_miss`. Calibrated from observed scores on the current corpus. Follow-up: retune as ingestion grows and scores tighten; keep the 3-tier shape stable so downstream (template Status column, eventual UI) can depend on the vocabulary.	Top-level confidence only (rejected: per-hit is strictly more informative); dynamic thresholds from current corpus percentiles (rejected: thresholds must be stable across runs for the labels to mean the same thing)
2026-04-21	Sonnet validated for the Full MVP generation role (commit fca3b7c — see demo/phase-b-model-comparison.md)	Dual-model walkthrough post-improvements: Sonnet matched Opus on lubrication filter use, confidence consumption, verbatim preservation (incl. source typos), price discipline, template order, and retrieval gap handling with SALESPERSON REVIEW blockquotes. Deltas were cosmetic (Opening block spacing, silent omission of oil/water separator on oilless systems) and both were addressed by template polish in commit 4258713. The design doc's "Assembly (Full MVP): Anthropic Sonnet via API" is no longer an assumption.	Stick with Opus-only (rejected: Full MVP economics favor Sonnet at this quality level); defer validation to Full MVP build time (rejected: cheaper to know now, commit to the API shape earlier)

E2 is the bridge. Shipping means CC can call these tools, get good results, and assemble a quote that looks like the 4M template. Nothing else.

Status (as of 2026-04-21): E2 is closed. All 9 stories shipped; both phase walkthroughs passed; two follow-ups (#1 lubrication, #6 confidence) landed from Phase B feedback; Sonnet validated for Full MVP. Deferred Phase B walkthrough findings (#3 category filter, #4 markdown rendering, #5 installation scope, #7 data gap, #8 get_product utility) are captured in next.md as post-demo pickups.

E2: MCP Servers — Epic Design Doc#

Overview#

Goals & Non-Goals#

Problem Statement#

What Is This Epic?#

Context#

Dependents#

Dependencies#

Current State#

Affected Systems#

Design#

Tool Surface#

@quoteai/equipment#

@quoteai/quotes#

Hybrid Search Logic (Critical)#

Template Reference#

Data Model Changes#

API / Interface Changes#

Edge Cases & Gotchas#

Testing Strategy#

Test Layers#

Required Fixtures#

Verification Rules#

Stories#

Known Issues / Tech Debt#

Risks#

Decisions Log#

Review