E3: TTS Playback (Native Browser Speech)

Status: Step 0 — Design Doc (Refinement Complete) Epic: Foundry v0.1 (no API dependency) Created: March 30, 2026 Updated: March 30, 2026 Authors: Dan Hannah & Clay

Overview

What Is This Epic?

Add text-to-speech audio playback to Foundry using the browser's native Web Speech API. Play buttons appear next to each section heading — click to hear that section read aloud. A "Play All" button at the doc title reads the entire document. No server, no API, no cost.

Problem Statement

Dan reviews docs while driving, in the shop, or doing CNC work. Text-only docs require a screen. TTS makes docs accessible in hands-free contexts. This should work from any device (iPad, phone, laptop) without requiring the API server to be running.

What Changed From the Original Stub

Original Scope	Revised Scope
Server-side TTS (Edge TTS / ElevenLabs)	Native browser Web Speech API
API endpoint for audio generation	Pure frontend — no server needed
Audio file caching by content hash	No files — browser generates speech on the fly
Required E2 API server	Independent — can ship before or after E2
Controlled voice selection	Device-native voices (iOS Siri voices, macOS, Chrome)

Key insight: The Web Speech API (speechSynthesis) is built into every modern browser. iOS voices are near-human quality. This eliminates the entire server-side audio pipeline — no generation, no storage, no caching, no cost. E3 becomes a pure static site feature.

Goals

Play button (▶️) next to each section heading — reads that section
"Play All" button at doc title — reads the full document section by section
Play/pause/speed controls (0.75x / 1x / 1.25x / 1.5x / 2x)
Visual indicator showing which section is currently being read
Auto-scroll to follow the currently-reading section (toggleable)
Works offline, works on static site, works everywhere

Non-Goals

Custom voice selection UI (use device default — it's good enough)
Server-side TTS generation (entire point is no server)
Audio file download/export
Background audio playback (browser limitation)
Remembering playback position across page loads (v0.3+ if wanted)
Reading non-text content (Mermaid diagrams, code blocks — skip or read code optionally)

Context

Dependencies

E1 — site scaffold, doc rendering, heading structure
No E2 dependency — this is pure frontend

Dependents

None directly. TTS is a standalone feature that enhances the reading experience.
Future: could inform E4 UX (e.g., "read this annotation aloud")

Design

Approach

A React island component (TtsPlayer.tsx) that uses the Web Speech API to read document sections aloud. The component receives section content as props (extracted from the rendered page) and manages playback state.

Architecture

┌─────────────────────────────────────────────┐
│  DocLayout.astro                            │
│                                             │
│  ┌──────────────────────────────────────┐   │
│  │  TtsControls.tsx (React island)      │   │
│  │  - Play All / Pause / Speed          │   │
│  │  - Current section indicator         │   │
│  │  - Auto-scroll toggle                │   │
│  └──────────────────────────────────────┘   │
│                                             │
│  ## Section Heading        [▶️ play button]  │
│  Section content here...                    │
│                                             │
│  ## Another Section        [▶️ play button]  │
│  More content here...                       │
│                                             │
│          ↕ Web Speech API                   │
│  ┌──────────────────────────────────────┐   │
│  │  Browser speechSynthesis engine      │   │
│  │  (native, zero cost, offline-capable)│   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Web Speech API — What We're Working With

// Core API
const utterance = new SpeechSynthesisUtterance(text);
utterance.rate = 1.25;  // 0.1 to 10
utterance.voice = speechSynthesis.getVoices()[0]; // device default
speechSynthesis.speak(utterance);
speechSynthesis.pause();
speechSynthesis.resume();
speechSynthesis.cancel();

// Events
utterance.onstart = () => { /* highlight section */ };
utterance.onend = () => { /* advance to next section */ };
utterance.onboundary = (e) => { /* word-level tracking */ };

Browser support:

Safari (iOS/macOS): ✅ Excellent voices (Siri neural), reliable API
Chrome: ✅ Good voices, sometimes cuts out on very long utterances
Firefox: ✅ Basic support, fewer voice options
Edge: ✅ Good voices (same as Edge TTS)

Section Text Extraction

Sections are defined by heading boundaries. The component needs to:

Find all headings (h2, h3, h4) in the rendered doc
Extract text content between each heading and the next heading of same or higher level
Strip HTML tags, keep readable text
Skip or optionally include code blocks (configurable)

This extraction happens client-side from the rendered DOM — no build-time processing needed.

Playback Modes

Section Play (▶️ next to heading):

User clicks play button next to "## Refinement"
Extract text content for that section
Create SpeechSynthesisUtterance with section text
Highlight the section being read (subtle background color)
Button becomes pause (⏸️) during playback
On end: button returns to play, highlight removed

Play All (▶️ at doc title):

User clicks "Play All" at the top
Build ordered list of all sections
Play section 1 → on end → play section 2 → ... → done
Current section highlighted, auto-scroll follows (if enabled)
Global pause/resume affects current section
Skip forward/back between sections

Speed Control:

Dropdown or segmented control: 0.75x / 1x / 1.25x / 1.5x / 2x
Stored in localStorage (persists across sessions)
Applied to current and future utterances

Chrome Long-Text Workaround

Chrome has a known bug where speechSynthesis stops after ~15 seconds of continuous speech. The standard workaround:

// Split long text into sentences/chunks (~200 chars each)
// Speak sequentially with onend chaining
// This is invisible to the user — just an implementation detail

This is well-documented and every TTS library handles it. Our implementation will chunk text at sentence boundaries.

UI Components

TtsControls (floating or sticky):

Appears when any playback is active
Play/Pause button
Speed selector
Current section name
Skip prev / skip next (in Play All mode)
Auto-scroll toggle
Close/stop button

Section Play Button:

Small ▶️ icon to the right of each heading
Appears on hover (desktop) or always visible (mobile/tablet)
Becomes ⏸️ during playback of that section
Subtle, doesn't clutter the reading experience

Key Design Decisions

Pure client-side, no server. The entire TTS pipeline runs in the browser. No API calls, no audio files, no caching needed. The browser IS the TTS engine.

Device-native voices. We don't control the voice — it's whatever the user's OS provides. This is a feature, not a limitation: iOS Siri voices are excellent, and the user already chose their preferred system voice.

No E2 dependency. E3 can ship as a pure static site update. This means it can ship in parallel with or even before E2, unblocking the reading experience immediately.

Section-level granularity. Play buttons at heading level, not paragraph level. Keeps the UI clean and maps to how people think about doc structure. "Read me the Refinement section" is natural; "read me paragraph 3" is not.

Auto-scroll is opt-in. Default off — some users will read ahead while listening. Toggle saved to localStorage.

Edge Cases & Gotchas

Scenario	Expected Behavior	Why It's Tricky
Browser doesn't support Web Speech API	Play buttons hidden, graceful degradation	Feature detection: `'speechSynthesis' in window`
iOS Safari requires user gesture to start speech	First play must be from a click/tap event	Can't auto-play on page load — standard iOS restriction
Chrome 15-second timeout on long utterances	Chunk text at sentence boundaries, chain utterances	Well-known Chrome bug, standard workaround
Code blocks in sections	Skip by default, or read with "code block" prefix	Reading `const x = 42` aloud is awkward
Mermaid diagrams	Skip entirely	Can't meaningfully read a diagram
Tables	Read cell-by-cell with row/column context, or skip	Reading tables aloud is tricky — may need "table with N rows" summary
User navigates away mid-playback	Cancel speech on page navigation	`speechSynthesis.cancel()` in cleanup
Multiple play buttons clicked	Cancel current, start new	Only one utterance at a time
Very short sections (one sentence)	Still plays fine	No special handling needed
Dark mode	Play buttons and controls styled for both themes	Use CSS custom properties

Risks

Risk	Likelihood	Impact	Mitigation
Voice quality varies by device	Medium	Low	iOS/macOS are excellent. Chrome/Firefox are decent. Users can change their system voice.
Chrome long-utterance bug	High	Low	Well-known workaround (chunking). Every TTS library handles this.
Web Speech API deprecated	Very Low	High	W3C standard, shipped in all browsers. No deprecation signals.
User confusion about "no voice"	Low	Low	Show "TTS not supported" message if API missing

Testing Strategy

Test Layers

Layer	Applies?	Notes
Unit tests	Yes	Text extraction, chunking logic, section parsing
Component tests	Yes	TtsControls renders correctly, button states toggle
Integration tests	Partial	Web Speech API can be mocked for state management tests
Manual QA	Yes	Dan tests on iPad (iOS Safari), laptop (Chrome/Safari)

Verification Rules

Play button appears next to every h2/h3/h4 heading
Clicking play reads the correct section text
Play All reads all sections in order
Pause/resume works mid-section
Speed control changes reading speed immediately
Current section is visually highlighted during playback
Auto-scroll follows the current section (when enabled)
Play buttons hidden if Web Speech API not available
Works on iPad Safari (primary use case)
Works on Chrome desktop
Code blocks are skipped (or handled gracefully)
Page navigation cancels active speech

Stories

Story	Summary	Batch	Dependencies
S1	Section text extraction + Web Speech API wrapper	1	E1
S2	Section play buttons (per-heading ▶️/⏸️)	2	S1
S3	Play All + sequential section playback	2	S1
S4	TTS controls bar (speed, skip, auto-scroll)	3	S2, S3

Execution plan: S1 → [S2, S3 parallel] → S4

S1: Section Text Extraction + Web Speech API Wrapper

Summary: Build the utility layer that extracts readable text from doc sections and wraps the Web Speech API with chunking, event handling, and error management.

Acceptance Criteria:

src/utils/tts-extractor.ts:
- Takes a DOM container element, returns ordered array of sections:
```
interface TtsSection {
  id: string;           // heading element id (for scrolling)
  title: string;        // heading text
  level: number;        // 2, 3, or 4
  text: string;         // readable content (HTML stripped)
  element: HTMLElement;  // reference to heading element
}
```
- Extracts text between headings (heading N content = everything until next heading of ≤ N level)
- Strips HTML tags, keeps readable text
- Skips Mermaid diagram containers (.mermaid class)
- Skips code blocks by default (configurable: { includeCode?: boolean })
- Tables: extracts cell text in reading order with row separators
src/utils/tts-engine.ts:
- Wraps speechSynthesis API
- speak(text, options?) → starts speech
- pause() / resume() / cancel()
- setRate(rate) — 0.75 / 1 / 1.25 / 1.5 / 2
- Chunks long text at sentence boundaries (~200 chars) for Chrome workaround
- Chains chunks via onend events (seamless to caller)
- Events: onStart, onEnd, onPause, onResume, onBoundary
- Feature detection: isSupported() returns boolean
- Cleanup: cancel() on unmount
Unit tests for text extraction (mock DOM)
Unit tests for chunking logic (sentence splitting)
Feature detection works (returns false when API missing)

Boundaries:

Do NOT build UI components (S2, S3, S4)
Do NOT add to the doc layout yet
This is the utility layer only

S2: Section Play Buttons

Summary: Add a play button next to each section heading that reads that section aloud using the TTS engine.

Acceptance Criteria:

src/components/SectionPlayButton.tsx (React island):
- Small ▶️ button rendered to the right of each h2/h3/h4 heading
- Desktop: visible on hover over the heading area
- Mobile/tablet: always visible
- Click toggles between play (▶️) and pause (⏸️)
- Clicking play on Section A while Section B is playing: cancel B, start A
Integration into DocLayout.astro:
- Script that finds all headings and mounts SectionPlayButton next to each
- Or: Astro component wraps headings during markdown rendering
Visual feedback:
- Active section gets subtle highlight (background color via CSS custom property)
- Highlight removed when section finishes or is cancelled
Styled for both light and dark themes
Buttons hidden if tts-engine.isSupported() returns false
speechSynthesis.cancel() called on page navigation (cleanup)
Manual QA: works on iPad Safari and Chrome desktop

Boundaries:

Do NOT implement Play All or sequential playback (S3)
Do NOT add speed controls or TTS control bar (S4)
Single section at a time only

Dependencies: S1 (extraction + engine)

S3: Play All + Sequential Section Playback

Summary: Add a "Play All" button at the doc title that reads the entire document section by section, with skip forward/back capability.

Acceptance Criteria:

"Play All" button rendered next to the doc title (h1)
- Click starts sequential playback from first section
- If a section is already playing, "Play All" starts from that section forward
Sequential playback logic:
- On section end → automatically start next section
- Highlight advances to current section
- Skip forward: cancel current, start next section
- Skip back: cancel current, restart current section (double-tap skip back = previous section)
Playback state management:
- isPlaying, isPaused, currentSectionIndex, totalSections
- Shared state between Play All and individual section buttons
- If user clicks a section's ▶️ during Play All: jump to that section, continue sequential from there
Stop button: cancels all playback, resets state
Auto-scroll (foundation):
- When playing, current section heading scrolls into view
- Smooth scroll behavior
- Can be toggled (state managed, toggle UI in S4)
- Default: off (stored in localStorage key foundry-tts-autoscroll)
Works on iPad Safari and Chrome desktop

Boundaries:

Do NOT add speed controls (S4)
Do NOT add floating control bar (S4)
Skip/stop can use basic buttons near the title for now (refined in S4)

Dependencies: S1 (extraction + engine)

S4: TTS Controls Bar

Summary: Add a persistent controls bar for managing playback — speed control, skip, auto-scroll toggle. Appears when any playback is active.

Acceptance Criteria:

src/components/TtsControls.tsx (React island):
- Sticky bar at bottom of viewport (or below header)
- Appears with animation when playback starts
- Disappears when playback stops/cancelled
Controls:
- Play/Pause toggle button
- Speed selector: 0.75x / 1x / 1.25x / 1.5x / 2x (segmented buttons or dropdown)
- Skip previous / Skip next buttons (visible in Play All mode)
- Current section title display
- Progress: "Section 3 of 12" (in Play All mode)
- Auto-scroll toggle (🔒 icon or similar)
- Close/stop button (✕)
Speed preference persisted to localStorage key foundry-tts-speed
Speed change applies immediately to current playback
Styled for both light and dark themes
Responsive: works on mobile (simplified layout if needed)
Doesn't overlap with content (padding/margin adjustment when bar is visible)
Keyboard shortcuts (optional but nice):
- Space: play/pause
- Left/Right arrow: skip prev/next (when TTS bar is focused)
Manual QA: Dan approves the look and feel on iPad + desktop

Boundaries:

Do NOT add voice selection (out of scope entirely)
Do NOT add playback position memory across page loads
Keep it simple — this is v1 controls

Dependencies: S2 (section buttons), S3 (Play All + state management)

Decisions Log

Date	Decision	Rationale	Alternatives Considered
Mar 30	Web Speech API (native browser TTS)	Zero cost, works offline, no server needed, iOS voices are excellent	Edge TTS (server-side, needs API), ElevenLabs (paid, high quality but overkill)
Mar 30	No E2 dependency	Pure frontend feature, can ship on static site	Wait for API server (unnecessary delay)
Mar 30	Section-level play buttons at headings	Maps to how people think about docs. Clean UI.	Paragraph-level (too granular), page-level only (too coarse)
Mar 30	Skip code blocks by default	Reading `const x = 42` aloud is awkward	Always include (bad UX), never include (might miss important pseudocode)
Mar 30	Auto-scroll opt-in (default off)	Some users read ahead while listening	Default on (annoying if you're reading a different section), no auto-scroll (miss the context)
Mar 30	Device-native voices (no selection UI)	iOS/macOS voices are great, adds zero complexity	Voice picker (scope creep, low value for our use case)
Mar 30	Chrome chunking workaround	Well-documented bug, standard solution	Ignore (broken on Chrome), use different API (none available)

Known Issues / Tech Debt

Issue	Severity	Status	Notes
Background tab may pause speech	Low	Accept	Browser behavior — speech pauses when tab isn't focused. No workaround. iOS handles this better than desktop.
Table reading is awkward	Low	Accept	Best-effort text extraction from tables. Won't sound natural but conveys info.
No playback position memory	Low	Deferred	User has to re-start from beginning if they leave and come back. Could save `currentSectionIndex` to localStorage in v0.3+.

E3: TTS Playback (Native Browser Speech)#

Overview#

What Is This Epic?#

Problem Statement#

What Changed From the Original Stub#

Goals#

Non-Goals#

Context#

Dependencies#

Dependents#

Design#

Approach#

Architecture#

Web Speech API — What We're Working With#

Section Text Extraction#

Playback Modes#

Chrome Long-Text Workaround#

UI Components#

Key Design Decisions#

Edge Cases & Gotchas#

Risks#

Testing Strategy#

Test Layers#

Verification Rules#

Stories#

S1: Section Text Extraction + Web Speech API Wrapper#

S2: Section Play Buttons#

S3: Play All + Sequential Section Playback#

S4: TTS Controls Bar#

Decisions Log#

Known Issues / Tech Debt#

Review

E3: TTS Playback (Native Browser Speech)

Overview

What Is This Epic?

Problem Statement

What Changed From the Original Stub

Goals

Non-Goals

Context

Dependencies

Dependents

Design

Approach

Architecture

Web Speech API — What We're Working With

Section Text Extraction

Playback Modes

Chrome Long-Text Workaround

UI Components

Key Design Decisions

Edge Cases & Gotchas

Risks

Testing Strategy

Test Layers

Verification Rules

Stories

S1: Section Text Extraction + Web Speech API Wrapper

S2: Section Play Buttons

S3: Play All + Sequential Section Playback

S4: TTS Controls Bar

Decisions Log

Known Issues / Tech Debt