E8: Live Content Pipeline — Design Doc
Status: Deployed — F4-S4 (Deferred Anvil Init) in refinement, all others shipped ✅ Created: April 1, 2026 Authors: Clay (stub), Dan (refinement) Last updated: April 3, 2026
Overview
E8 transforms Foundry from a static-build platform into a live content platform. Currently, adding or editing docs requires: push to source repo → manually trigger Foundry rebuild → wait for Docker build + deploy. E8 makes it: push to source repo → Foundry detects the change → pulls fresh content → rebuilds nav + reindexes Anvil → serves updated pages. No manual intervention.
Why This Matters
- GMPPU deployment — they can't maintain a nav.yaml or trigger manual rebuilds. Content push must Just Work™.
- Our own workflow — every time we add a doc, we have to push content, update nav.yaml, push foundry, wait for deploy. Three steps that should be one — push is the trigger, everything else is automatic.
- Foundry as a product — if we ever offer this to others, "edit your nav.yaml and rebuild Docker" is a non-starter.
Key Architecture Decision
Astro SSR migration. Currently Astro generates static HTML at Docker build time. E8 moves to server-side rendering so pages are generated from markdown at request time. This means:
- No rebuild needed when content changes
- Nav generates from directory structure automatically
- Anvil reindexes on content update
- Single webhook endpoint handles everything
Scope (High Level — Needs Refinement)
Content Lifecycle
POST /api/webhooks/content-update— receives GitHub webhook on source repo push- Webhook must verify GitHub signature (
X-Hub-Signature-256) — see Security section - Foundry pulls fresh content via git (runtime, not build time)
- Auto-generates nav tree from directory structure + frontmatter metadata
- Triggers Anvil reindex of changed/new content
- Serves updated pages immediately (SSR, no rebuild)
Astro SSR Migration
- Switch from static generation to on-demand rendering
- Markdown parsed and rendered at request time
- Content cached aggressively (invalidated by webhook)
- Static assets (CSS, JS, images) still served statically
Auto-Nav Generation
Decision (from review): Auto-nav by default. No manual/auto toggle per source in v1.
- Directory structure → nav tree (folder names or frontmatter
titleoverride) - Sort: alphabetical default,
orderfrontmatter field for custom ordering - Optional
nav.yamloverride for when someone needs fully custom ordering (escape hatch, not primary mode) - No hybrid mode in v1 — keep it simple
Anvil Live Reindexing
- On content update: use GitHub webhook payload (commit diff) to identify changed files, reindex only those
- Anvil owns content intelligence — diffing what changed between versions is a natural extension of its chunking/indexing role
- Foundry owns orchestration — webhook receiver, git pull, triggering Anvil reindex, cache invalidation
- Full reindex available via API/CLI for recovery
- Index serves search results immediately after reindex completes
Error Handling & Rollback
- Last-known-good content — if a pull breaks rendering, serve the previous version while flagging the error
- Error surface — status banner in Foundry UI: "Content update failed at [timestamp] — check source repo"
- Status endpoint —
GET /api/content/statusfor programmatic health checks - Logged details — full error details for debugging, not just "something broke"
- Users should never see a broken page
Security & Configuration
Webhook Security
- Verify GitHub webhook signatures (
X-Hub-Signature-256) to prevent spoofing - Reject requests with invalid or missing signatures
Git Credentials in Container
- Use deploy keys (read-only SSH keys) scoped to the specific source repo
- Do NOT use PATs with broad repository access
- Deploy key configured at Docker level, not stored in application code
Content Rendering Safety
- Astro sanitizes HTML by default in markdown rendering — verify this covers all XSS vectors
- Consider explicit allowlist for HTML elements in markdown if needed
- Audit: embedded
<script>tags, event handlers,javascript:URLs
Configuration Layers
Docker / Infrastructure (set at deploy time):
- Deploy key path
- Webhook secret
- Source repo URL
- Branch to track
Foundry UI Settings (runtime, no redeploy needed):
- Future: Which branches to track, nav mode preferences, reindex triggers
- Enables non-technical users (e.g., GMPPU) to configure content sources without touching Docker
Design Decisions (Resolved)
| Decision | Resolution | Rationale |
|---|---|---|
| Steps to update content | One (push) | User pushes to source repo. Everything else is automatic. |
| Multi-source repos | Single source for E8 | Eliminates concurrency complexity. GMPPU = one repo, we = one repo. Multi-source is future. |
| Nav generation | Auto by default | Directory structure drives nav. Optional nav.yaml override as escape hatch. No hybrid mode in v1. |
| Change detection | Anvil owns diffing, Foundry owns orchestration | Clean package boundary. Anvil becomes more valuable standalone — any MCP consumer can use its diff capability. |
| Error recovery | Last-known-good + error banner | Never show broken pages. Surface errors in UI and logs. |
| Git credentials | Deploy keys (read-only) | Scoped to specific repo. No broad PATs in containers. |
| Settings management | Docker for infra, UI for runtime (future) | Infra concerns at deploy time, user preferences at runtime. |
Open Questions
- SSR performance — is on-demand rendering fast enough, or do we need incremental static regeneration (ISR)?
- Approach: Build with aggressive in-memory caching, benchmark. Astro SSR with cached markdown parsing should handle our scale (< 100 pages). Only pivot to ISR if we see > 500ms TTFB.
- Caching strategy — GitHub webhook payload tells us which files changed, so we can invalidate specific routes. In-memory cache with route-level invalidation? CDN-level cache-busting? Anvil as a change detection layer for smarter invalidation is a v2 idea.
- Approach: In-memory
Map<route, rendered>with per-route invalidation from webhook payload. CDN-level is premature for our traffic. Anvil smart invalidation deferred to v2.
- Approach: In-memory
Auth mechanism for private content gating (F6-S1)RESOLVED: Shared secret token (X-Foundry-Tokenheader, checked against env var). Simple binary gate: have the token = see private content, don't = public only. No per-user identity in v1. Upgrade to session-based auth in v2 when multi-user features (user-scoped annotations, edit history) require it.
Story Breakdown
Feature 1: Astro SSR Migration (F1)
The foundation — switches Foundry from static build to server-rendered pages. Everything else builds on this.
F1-S1: SSR Adapter + Dynamic Content Loader
What: Install @astrojs/node adapter, switch output mode from static to server, replace Astro content collection with a runtime markdown loader that reads from disk.
Why: Currently [...slug].astro uses getCollection('docs') + getStaticPaths() which only works at build time. SSR needs a dynamic page that reads markdown from a content directory at request time.
Technical details:
- Add
@astrojs/nodetopackages/site - Change
astro.config.mjs:output: 'server', add node adapter - Replace
content.config.tscollection-based loading with a filesystem-based loader [...slug].astrobecomes a dynamic route: read markdown file fromcontent/dir, parse frontmatter, render with remark/rehype pipeline- Keep existing remark plugins (admonitions, mermaid) and rehype plugins (slug, autolink)
- Astro's
<Content />component won't work for runtime markdown — need to useunifiedpipeline directly (remark → rehype → HTML string) - Static assets (CSS, JS, component islands) still pre-built
Acceptance criteria:
npm run devserves pages via SSR (nogetStaticPaths)- Adding a new
.mdfile tocontent/makes it immediately accessible (no rebuild) - Existing pages render identically (remark/rehype plugins, code highlighting, mermaid)
- React islands (annotation panel, search) still hydrate correctly
Risk: Remark/rehype plugin compatibility with runtime rendering. Shiki (syntax highlighting) is async and may need special handling outside Astro's built-in pipeline. Test with a complex page (code blocks + mermaid + admonitions).
F1-S2: In-Memory Page Cache
What: Cache rendered HTML in memory, keyed by route path. Serve from cache on subsequent requests. Cache entries invalidated by content update pipeline (F3).
Technical details:
Map<string, { html: string; frontmatter: object; mtime: number }>in API or Astro middleware- On request: check cache → hit = serve immediately, miss = read file + render + cache
- Cache warm-up on startup: pre-render all pages (optional, can defer to lazy)
- Memory budget: ~100 pages × ~50KB rendered = ~5MB — trivial
- Expose
invalidateRoute(path)andinvalidateAll()for F3 webhook integration
Acceptance criteria:
- Second request to same page is < 10ms (cache hit)
- Cache miss renders page and stores result
invalidateRoute('/docs/foo')forces next request to re-render from diskGET /api/content/statusincludes cache stats (size, hit rate)
Feature 2: Auto-Nav Generation (F2)
Eliminates nav.yaml as a manual step. Directory structure becomes the nav tree.
F2-S1: Nav Tree Generator
What: Build nav tree from directory structure + frontmatter metadata. Replace nav.yaml parsing with filesystem scan.
Technical details:
- Scan
content/directory recursively - For each
.mdfile: read frontmatter fortitle,order,nav_title(optional short title) - Directory names become section labels (title-cased, hyphen → space) unless an
index.mdin the dir provides atitlein frontmatter - Sort:
orderfrontmatter field first (ascending), then alphabetical by title - Output: same data structure the sidebar component currently expects (nested
{ label, path, children }) - Cache the nav tree in memory — invalidate on content update
- Escape hatch: if
nav.yamlexists in content root, use it instead (full override, not merge) - Handle edge cases: empty directories (skip), files without frontmatter title (use filename), deeply nested (recursive, not limited to 4 levels)
Acceptance criteria:
- Nav tree matches directory structure without any
nav.yaml orderfrontmatter field controls sort order within a section- Adding/removing a file regenerates nav (after cache invalidation)
- Existing nav items render correctly in sidebar (no visual regression)
nav.yamloverride still works if present
F2-S2: Sidebar Component Refactor
What: Make sidebar component recursive (currently hardcoded to 4 levels — tech debt item). Consume nav tree from F2-S1.
Technical details:
- Current sidebar renders 4 nested levels explicitly — refactor to recursive
NavItemcomponent - Accept nav tree data as prop (from API endpoint or SSR page data)
- Expand/collapse state per section (persist in localStorage)
- Active page highlighting based on current route
- This also resolves the tech debt: "Sidebar component is 4-level hardcoded — should be recursive"
Acceptance criteria:
- Sidebar renders arbitrary depth nav trees (tested with 5+ levels)
- Visual appearance unchanged for existing 4-level nav
- Expand/collapse persists across page navigation
- Active page highlighted correctly
Feature 3: Content Pipeline + Webhook (F3)
The runtime content update mechanism — receives webhook, pulls content, orchestrates cache + nav + search invalidation.
F3-S1: Runtime Content Fetcher
What: Pull content from source repo at container startup and on-demand via internal trigger. Replaces the build-time scripts/build.sh for runtime use.
Technical details:
- On startup: clone/pull source repo to
content/directory using deploy key ContentFetcherclass with methods:pull(),getChangedFiles(oldRef, newRef),getCurrentRef()- Uses
child_process.execFile('git', ...)— not a git library (keep deps small) - Reads source repo URL + branch from env vars (
CONTENT_REPO,CONTENT_BRANCH) - Deploy key path from env var (
DEPLOY_KEY_PATH) pull()returns{ ref: string, changedFiles: string[] }— the ref is the new HEAD SHA, changed files come fromgit diff --name-only oldRef newRef- Mutex/queue: if a pull is already in progress, skip (drop duplicate webhook triggers)
Acceptance criteria:
- Container starts → clones source repo → content directory populated
pull()returns list of changed files- Concurrent
pull()calls don't race (second call skipped) - Works with deploy key SSH auth
- Falls back gracefully if repo is unreachable (logs error, serves last-known content)
F3-S2: Webhook Endpoint
What: POST /api/webhooks/content-update — receives GitHub push events, verifies signature, triggers content pull + downstream updates.
Technical details:
- Verify
X-Hub-Signature-256header usingWEBHOOK_SECRETenv var - Parse GitHub push event payload: extract
commits[].added,commits[].modified,commits[].removed - On valid webhook: trigger
ContentFetcher.pull()→ get changed files → invalidate page cache (F1-S2) → regenerate nav tree (F2-S1) → trigger Anvil reindex (F4) - Return
200 OKimmediately, run pipeline async (webhook timeout is 10s) - Reject: missing signature (401), invalid signature (403), non-push events (200 + ignored)
- Log: timestamp, commit SHA, files changed, pipeline duration
Acceptance criteria:
- Valid GitHub webhook triggers content pull + cache invalidation
- Invalid/missing signature rejected with appropriate status code
- Pipeline runs async (webhook responds within 1s)
- Changed pages serve updated content within 5s of webhook
GET /api/content/statusshows last webhook timestamp + result
F3-S3: Error Handling + Status
What: Last-known-good fallback, error banner in UI, and status endpoint for monitoring.
Technical details:
- Last-known-good: Before pulling new content, snapshot current
content/directory ref. If new content breaks rendering (any page throws), revert to snapshot ref and flag error. - Status endpoint:
GET /api/content/statusreturns{ lastUpdate: timestamp, status: 'ok' | 'error', lastError?: string, contentRef: string, cacheSize: number, cacheHitRate: number } - Error banner: React component in
DocLayout— fetches/api/content/statuson mount, shows dismissible warning banner ifstatus === 'error' - Error detection: Try rendering each changed page after pull. If any throws, that's a broken page. Revert the whole pull (simpler than partial rollback in v1).
Acceptance criteria:
- Broken content in source repo does NOT break live site
- Error banner appears when content update fails
- Status endpoint reflects current pipeline state
- Error details logged server-side for debugging
Feature 4: Anvil Enhancements (F4)
Upgrades Anvil for live content workflows. Bundles the reindex API, new MCP tools, and existing tech debt.
Current State — Anvil MCP Tools
Foundry currently exposes 2 Anvil-backed MCP tools (via Foundry's MCP server):
search_docs— semantic search across indexed contentlist_pages— list all pages with paths and access levels
The standalone Anvil MCP server (via mcporter) exposes 5 tools:
search_docs,get_page,get_section,list_pages,get_status
Gap: Foundry's MCP layer only proxies search_docs and list_pages. It doesn't expose get_page, get_section, or get_status — agents using Foundry's MCP server can search but can't read the actual doc content.
F4-S1: Delta Reindex API
What: Add reindexFiles(files: string[]) method to Anvil that reindexes only the specified files. Complements the existing index({ force }) full-reindex method.
Technical details:
- New method on the
Anvilinterface:reindexFiles(files: string[]): Promise<IndexResult> - For each file in the list:
- If file exists on disk: read content, re-chunk, re-embed, upsert (same as
indexFilein currentIndexer) - If file was deleted: remove all chunks for that file path
- If file exists on disk: read content, re-chunk, re-embed, upsert (same as
- Returns same
IndexResultshape asindex()(files_processed, chunks_added/updated/deleted, duration_ms) - Foundry calls this from the webhook pipeline (F3-S2) with the changed file list
- Existing
index()method unchanged (full reindex for recovery/CLI use) - Tech debt bundled: Fix the pre-existing search test failures (5 tests in
search.test.ts— Anvil mock issue from E2). These tests mock Anvil incorrectly; fix the mocks while we're touching the Anvil integration.
Acceptance criteria:
reindexFiles(['path/to/changed.md'])reindexes only that filereindexFiles(['path/to/deleted.md'])removes chunks for deleted file- Full
index()still works for recovery - Search returns updated content immediately after reindex
- Pre-existing search test failures fixed
F4-S2: Foundry MCP Tool Expansion + Content Enrichment ✅ SHIPPED
What: Add missing Anvil-backed tools to Foundry's MCP server (get_page, get_section, get_status, reindex) and enrich the get_page API to return actual markdown content — not just section metadata.
Why: Agents using Foundry's MCP server could search but couldn't read doc content. The GET /api/docs/:path route only returned heading/level/charCount per section. With this change, get_page returns the full readable markdown for each section, and all 4 new MCP tools are wired end-to-end.
What shipped:
- MCP tools (
get_page,get_section,get_status,reindex) were already wired in E9 (server.ts + http-client.ts + API routes) GET /api/docs/:pathenriched: sections now include acontentfield with concatenated markdown from all chunks sharing the same heading, sorted by ordinalGET /api/docs/:path/sections/:headingalready returned content via Anvil passthrough — verified- 6 docs tests passing, including content aggregation test
Acceptance criteria: ✅ All met
get_pagevia MCP returns page content readable by agentsget_sectionreturns specific section with contentget_statusreturns index health inforeindextriggers full reindex (auth required)- All 4 tools work via both MCP (mcporter) and HTTP API
- Existing
search_docsandlist_pagestools unchanged
F4-S3: File Watcher for Local Dev Auto-Reindex ✅ SHIPPED
What: Add a file watcher that auto-triggers Anvil reindex when .md files change in the content directory during local development. Eliminates the need to manually POST /api/reindex after editing docs.
Why: In production, webhooks handle content updates. But in dev, you're editing files directly. Without this, developers had to manually trigger reindex after every content change to see updated search results.
What shipped:
- New
packages/api/src/file-watcher.ts: watchesCONTENT_DIRusingfs.watch({ recursive: true }) - 500ms debounce — rapid edits batch into a single reindex call
- Prefers
reindexFiles()(delta) when available, falls back to fullindex() - Only runs when
NODE_ENV !== 'production'— production uses webhooks - Wired into
packages/api/src/index.tsafterapp.listen()(alongside background Anvil index) - 8 tests covering: reindex on change, non-.md filtering, debounce batching, dedup, delta vs full fallback, error handling
Acceptance criteria: ✅ All met
- File watcher auto-triggers Anvil reindex on
.mdfile changes in dev mode - 500ms debounce prevents rapid-fire reindexing
- Uses delta reindex (
reindexFiles) when available, falls back to fullindex() - Does NOT run in production (
NODE_ENV=production) - All tests pass
F4-S4: Deferred Anvil Initialization (Lazy Loading)
What: Start the Express API server immediately without waiting for Anvil to initialize. Anvil loads in the background and becomes available when ready. Search/doc endpoints return "indexing in progress" until initialization completes.
Why: Even with the OpenAI embedding fix (F4-S2 enhancement — dropped init from 49s to ~1s), the architecture is fragile. The entire server startup is blocked on createAnvil() completing — every route mount, middleware setup, and health check waits. If the embedding provider has a network hiccup, or we switch back to local embeddings, or the index grows large enough that re-embedding takes time, we're back to white-page cold starts. The server should never be blocked on Anvil.
Current flow (blocking):
clone content → createAnvil() [BLOCKS] → mount routes → app.listen() → ready
Target flow (non-blocking):
clone content → mount routes (anvil=null) → app.listen() → ready
↓ (background)
createAnvil() → swap in → index()
Technical details:
-
Anvil holder pattern — Replace the direct
anvilvariable with a lazy container:// packages/api/src/anvil-holder.ts class AnvilHolder { private anvil: Anvil | null = null; private initializing = false; private ready = false; get(): Anvil | null { return this.anvil; } isReady(): boolean { return this.ready; } isInitializing(): boolean { return this.initializing; } async init(docsPath: string): Promise<void> { this.initializing = true; try { this.anvil = await loadAnvil(docsPath); this.ready = true; } finally { this.initializing = false; } } } -
Route changes — Routes already have Anvil-optional fallbacks (
if (anvil) { ... } else { 503 }). Refactor to pull Anvil from the holder instead of closure variable. Whenholder.isInitializing(), return a specific response:{ "status": "initializing", "message": "Search index is loading, please retry in a few seconds" }Use HTTP 503 with
Retry-After: 5header so clients/agents know to retry. -
Health check —
/api/healthshould return200 OKeven when Anvil is initializing (the server is healthy, search just isn't ready yet). Add ananvil.statusfield:"ready","initializing", or"unavailable". -
Background init in
index.ts— Afterapp.listen(), kick offanvilHolder.init(docsPath)as a fire-and-forget promise. When it resolves, log success and trigger the background index. Same pattern as the current backgroundanvil.index()call. -
Webhook handler — Already checks
if (anvil)before reindexing. Just needs to pull from holder instead of closure. If Anvil isn't ready when a webhook fires, queue the reindex (or just skip — the background init will do a full index anyway).
What NOT to change:
- MCP server initialization (doesn't depend on Anvil — uses HTTP API)
- Static file serving, proxy setup, access map generation
- The embedding provider config (F4-S2 enhancement) — orthogonal
Acceptance criteria:
- Server responds to health checks within 3s of container start (no Anvil dependency)
- Pages load and render immediately (SSR doesn't need Anvil)
- Search/doc API returns 503 +
Retry-Afterduring Anvil init, then works normally - Health endpoint returns 200 with
anvil.status: "initializing"during load,"ready"after - No behavior change once Anvil is fully initialized
- Webhook reindex works correctly if Anvil finishes init after a webhook fires
- Tests cover: holder state transitions, route behavior during init, route behavior when ready
Estimated lift: Medium — 1-2 hours agent time. Mostly refactoring index.ts route mounting to use the holder pattern instead of closure capture. The routes themselves already handle anvil = null.
Risk: Low. The Anvil-optional paths already exist and are tested. Main risk is ensuring the holder swap is thread-safe (Node is single-threaded so this is basically free, but the async init promise needs clean error handling).
Feature 5: Deployment + Infrastructure (F5)
Updates the Docker build and Fly.io deployment for the SSR + runtime content model.
F5-S1: Dockerfile + Docker Compose Updates
What: Rebuild Dockerfile for SSR mode. Remove build-time content fetching, add runtime dependencies (git, SSH), configure deploy key volume mount.
Technical details:
- Remove Stage 1 (content-fetcher) from Dockerfile — content is pulled at runtime now
- Add to runtime stage:
git,openssh-client(for deploy key auth) - Astro builds in SSR mode (
output: 'server') — produces a Node.js server instead of static files - Entry point: Start both Astro SSR server + Express API server (or unify into single process — TBD, may be a design decision to add)
- Deploy key mounted as volume:
-v ./deploy-key:/app/.ssh/deploy_key:ro - Env vars:
CONTENT_REPO,CONTENT_BRANCH,WEBHOOK_SECRET,DEPLOY_KEY_PATH - Content directory:
/app/content/(persistent volume on Fly.io, ephemeral elsewhere) - Cold start: Container starts → pulls content → builds cache + nav → ready to serve
Acceptance criteria:
docker buildsucceeds without any source repo access- Container starts and pulls content on first boot
- Content survives container restart (Fly.io volume)
- Deploy key auth works for private repos
- Container size remains reasonable (< 500MB)
F5-S2: Fly.io Config + GitHub Webhook Setup
What: Update fly.toml for SSR, create Fly.io volume for content persistence, configure GitHub webhook on source repo.
Technical details:
- Update
fly.toml: internal port matches Astro SSR server port, health check →/api/content/status - Create Fly.io volume:
fly volumes create foundry_content --size 1 --region iad - Mount volume at
/app/content/infly.toml - Set secrets:
fly secrets set WEBHOOK_SECRET=... DEPLOY_KEY_PATH=/app/.ssh/deploy_key - Configure GitHub webhook on
csdlc-docsrepo: push events →https://foundry.fly.dev/api/webhooks/content-update - Generate + configure deploy key:
ssh-keygen→ add public key to repo deploy keys → mount private key in container - Update GitHub Actions workflow for SSR build (no
FOUNDRY_CONFIGsecret needed — config is runtime)
Acceptance criteria:
- Push to
csdlc-docstriggers webhook → content updates on live site within 30s - Content persists across deploys (volume mount)
- Health check endpoint works for Fly.io monitoring
- Deploy key has read-only access to source repo only
Dependency Graph
F1-S1 (SSR Adapter)
↓
F1-S2 (Page Cache) ← F3-S2 invalidates
↓
F2-S1 (Nav Generator) ← F3-S2 regenerates
↓
F2-S2 (Sidebar Refactor)
F4-S1 (Delta Reindex) — ✅ SHIPPED
↓
F4-S2 (MCP Tool Expansion + Content) — ✅ SHIPPED
↓
F4-S3 (File Watcher, dev only) — ✅ SHIPPED
F4-S4 (Deferred Anvil Init) — READY
F3-S1 (Content Fetcher) — depends on F1-S1 (needs content dir structure)
↓
F3-S2 (Webhook) — depends on F3-S1, F1-S2, F2-S1, F4-S1 (orchestrates all)
↓
F3-S3 (Error Handling) — depends on F3-S2
F5-S1 (Dockerfile) — depends on F1-S1 (SSR build output)
↓
F5-S2 (Fly.io Config) — depends on F5-S1, F3-S2
Parallelism opportunities:
- F1 (SSR) and F4 (Anvil) can run in parallel — different packages, no code overlap
- F2-S1 (Nav Generator) can start once F1-S1 ships (needs SSR to serve generated nav)
- F3 and F5 come last — they wire everything together
Scope: 16 stories across 6 features. 15 shipped, 1 in refinement (F4-S4).
Dependencies
- E7 (UI/UX Refinement) should ship first — E8 changes serving infrastructure, E7 changes presentation ✅
- E9 (MCP Agent Surface) should ship first — E8 extends the MCP server E9 built ✅
- Anvil library API needs
reindexFiles()method (F4-S1 — built as part of this epic)
Not In Scope
- Multi-source repos (future epic)
- Multi-user editing (Supabase migration, v0.3+)
- Real-time collaboration (WebSocket-based, future)
- Git-based editing from Foundry UI (Suggest Edit feature, v0.3+)
- Anvil changelog feature ("what changed since your last visit") — cool v2 idea, not E8
- Anvil as CDN-level cache invalidation layer — v2 optimization
- Unifying Astro SSR + Express API into single process — evaluate during F5, decide then
Feature 6: Deployment Cleanup + Access Control (F6)
Post-deployment hardening. Addresses access control gap (all docs currently served as public) and replaces deployment workarounds with proper infrastructure patterns.
F6-S1: Runtime Access Map Generation
What: Generate .access.json from foundry.config.yaml after content clone, and enforce access checks in SSR page routes — not just API routes.
Why: Both the API layer (packages/api/src/access.ts) and the SSR nav generator (packages/site/src/utils/nav.ts) already read .access.json and respect access levels. But nothing generates that file at runtime. The content fetcher clones the source repo and stops — no .access.json gets written. Additionally, [...slug].astro serves any file on disk with zero access checks. The nav correctly hides private items from anonymous users, but direct URL access bypasses that entirely.
Technical details:
- Generator function: After
ContentFetcher.pull()completes, parsefoundry.config.yamlsources and produce.access.jsonin the content directory. The format is a prefix→access map:{ "methodology/": "public", "about/": "public", "projects/": "private" } - Trigger: Runs on container startup (after initial clone) AND after every webhook-triggered pull (config could change)
- SSR access check: In
[...slug].astro, after resolving the file path, call the same prefix-matching logic thatnav.tsuses. If the page isprivateand the request isn't authenticated, return 403 or redirect to a "not authorized" page. - API already handles this:
packages/api/src/routes/search.tsandpages.tsalready filter by access level — just need to make sureloadAccessMap()is called on startup after.access.jsonis written. - Auth model for v1: Simple — if a request has a valid session/token, they see private content. No per-user granularity. "Public" = everyone, "Private" = authenticated users. Auth mechanism TBD (could be a shared secret header for GMPPU, or a proper login for v2).
Acceptance criteria:
.access.jsongenerated in content directory after clone/pull- SSR pages under
projects/return 403 (or redirect) for unauthenticated requests - SSR pages under
methodology/andabout/serve normally without auth - Nav sidebar hides private sections for unauthenticated users (already works if
.access.jsonexists) - API search/pages endpoints respect access levels (already works if
.access.jsonexists) .access.jsonregenerated on webhook content updates
Risk: Auth mechanism is the open question. For GMPPU, a shared secret header or basic auth is fine for v1. Full user auth is a v2 concern. Story should implement the access map generation + SSR gating, with a pluggable auth check that can start as a simple shared secret.
F6-S2: Replace Hardcoded Sleep with Health Check Polling
What: Replace sleep 10 in start.sh with a retry loop that polls localhost:3001/api/content/status until the API is actually ready.
Why: The sleep 10 is calibrated to Anvil's current model load time (~8s). If Anvil gets faster, we waste startup time. If it gets slower (bigger model, more content to index), the proxy starts before the API is ready and requests fail. Either direction breaks.
Technical details:
- Replace the
sleep 10inscripts/start.shwith a poll loop:echo "Waiting for API server..." for i in $(seq 1 30); do if curl -sf http://localhost:3001/api/content/status > /dev/null 2>&1; then echo "API server ready after ${i}s" break fi if [ "$i" -eq 30 ]; then echo "ERROR: API server failed to start within 30s" exit 1 fi sleep 1 done - Add
curlto Dockerfile runtimeapt-get installline - Fail hard if API never comes up (exit 1) — Fly.io restarts the machine
- Remove the
sleep 10entirely
Acceptance criteria:
- Proxy starts within 2s of API becoming healthy (no wasted time)
- If API takes 20s to load (future heavier model), proxy still waits correctly
- If API never starts, container exits with error within 30s (Fly.io restarts)
curladded to Dockerfile runtime deps- No hardcoded sleep values remain in start.sh
Risk: Minimal. curl adds ~2MB to image. Poll pattern is standard container orchestration.
F6-S3: Canonical Content Directory (Remove Symlink)
What: Establish a single canonical content path (/app/content) via CONTENT_DIR env var. Both the content fetcher and Astro SSR read from this path directly. Remove the symlink from the Dockerfile.
Why: Currently: fetcher clones to packages/site/content/ → Dockerfile creates ln -s /app/packages/site/content/docs /app/content → Astro reads from /app/content. Three places that need to agree on where content lives. If any one changes, it silently breaks. A single env var eliminates the coordination problem.
Technical details:
CONTENT_DIRenv var (default:/app/content) — the ONE path everything uses- Content fetcher (
ContentFetcherclass): clone/pull directly into$CONTENT_DIRinstead ofpackages/site/content/ - Astro SSR content loader (
[...slug].astro): read markdown from$CONTENT_DIRinstead of resolving relative toprocess.cwd() + 'content' - Nav generator (
nav.ts): scan$CONTENT_DIRfor directory tree - Dockerfile changes:
- Remove the
RUN ln -s ...symlink line RUN mkdir -p /app/contentstays- Add
ENV CONTENT_DIR=/app/content
- Remove the
- Local dev:
CONTENT_DIR=./packages/site/content/docs(or similar) in.envso dev workflow unchanged - Audit all code referencing content paths — grep for
packages/site/contentandcontent/docs
Acceptance criteria:
- No symlinks in Dockerfile
- Content fetcher clones directly to
$CONTENT_DIR - Astro SSR resolves pages from
$CONTENT_DIR - Nav generator scans
$CONTENT_DIR - Local dev still works with
CONTENT_DIRpointed at local content grep -r "packages/site/content" --include="*.ts" --include="*.mjs" --include="*.sh"returns zero hits in runtime code (build-time refs in site-builder stage are fine)- Deployed site renders all pages identically (no visual regression)
Risk: Medium — touches multiple components (fetcher, SSR loader, nav generator, Dockerfile). Needs careful grep to find all path references. Worth testing locally with Docker build before deploying.
F6-S4: Trigger Anvil Index on Startup + Cache Warm-Up
What: After content clone completes on startup, trigger a full Anvil index and optionally pre-render all pages into the SSR cache.
Why: Two issues discovered in production:
- Anvil initializes (loads embedding model) but never indexes content on startup —
createAnvil()doesn't auto-index, and nothing callsindex()after the initial clone. Result: search returns nothing until a webhook triggers reindex. - SSR page cache is cold after every machine restart (Fly.io auto-stop). Every page is a first-hit (~200-500ms render) until someone visits it. Pre-rendering on startup eliminates this.
Technical details:
- In
packages/api/src/index.ts, after the content clone block and Anvil initialization:if (anvil) { console.log('📇 Running initial Anvil index...'); await anvil.index(); console.log('✅ Anvil index complete'); } - Cache warm-up (optional, same story): After Anvil indexes, scan
CONTENT_DIRfor all.mdfiles and hit each page route internally to populate the SSR cache. This way the first real user request is always a cache hit. - The webhook
onContentUpdatedcallback already handles reindex for subsequent updates — this story only covers the initial startup gap. - Note:
reindexFiles()doesn't exist yet (F4-S1). The webhook currently calls it and getsTypeError: anvil.reindexFiles is not a function. Until F4-S1 ships, the webhook should fall back to fullindex().
Acceptance criteria:
- Anvil has >0 indexed pages immediately after startup (verify via
/api/health) - Search returns results on first request after machine wake (no prior webhook needed)
- Startup time increase is acceptable (<5s for ~30 pages)
- Webhook reindex doesn't crash (graceful fallback if
reindexFilesunavailable)
Risk: Minimal. index() on 30 pages takes ~2-3s. Adds to cold start time but eliminates broken search state.
F6 Dependency Graph
F6-S2 (Health Poll) — ✅ SHIPPED
F6-S3 (Content Dir) — ✅ SHIPPED
F6-S1 (Access Map) — in progress
F6-S4 (Anvil Startup Index) — independent, quick win