Doc update again

2026-05-11 18:48:55 +10:00
parent 99fd903144
commit e8a3805fff
4 changed files with 1013 additions and 0 deletions
--- a/docs/ARCHITECTURE_RESILIENCE_REVIEW.md
+++ b/docs/ARCHITECTURE_RESILIENCE_REVIEW.md
@@ -567,6 +567,10 @@ Expected benefits:

 After the state model is explicit, persistence should become a background concern rather than a synchronous side effect of mutations.

+Dedicated design note:
+
+- [PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md)
+
 Target behavior:

 - mutations update authoritative in-memory stored state
@@ -589,6 +593,10 @@ Expected benefits:

 Once the render and state layers are cleaner, refactor the video backend into an explicit lifecycle model.

+Dedicated design note:
+
+- [PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md)
+
 Suggested states:

 - uninitialized
@@ -617,6 +625,10 @@ Expected benefits:

 This phase should happen after the main ownership changes so the telemetry can reflect the final architecture instead of a transient one.

+Dedicated design note:
+
+- [PHASE_8_HEALTH_TELEMETRY_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md)
+
 Recommended coverage:

 - render queue depth
--- a/docs/PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md
+++ b/docs/PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md
@@ -0,0 +1,301 @@
+# Phase 6 Design: Background Persistence
+
+This document expands Phase 6 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
+
+Phases 1-5 separate durable state, coordination policy, render-facing snapshots, render-thread ownership, and live-state layering. Phase 6 should make disk persistence a background snapshot-writing concern instead of a synchronous side effect of mutations.
+
+## Status
+
+- Phase 6 design package: proposed.
+- Phase 6 implementation: not started.
+- Current alignment: `RuntimeStore` owns durable state and serialization, while `RuntimeCoordinator` already publishes explicit persistence-request outcomes for persisted mutations. The remaining issue is that actual disk writes are still synchronous store work rather than queued, debounced, atomic background writes.
+
+Current persistence footholds:
+
+- `RuntimeStore` owns persistent runtime-state serialization and stack preset serialization.
+- `LayerStackStore` owns durable layer and parameter state.
+- `RuntimeCoordinatorResult::persistenceRequested` exists as an explicit mutation outcome.
+- `RuntimeEventType::RuntimePersistenceRequested` exists as the event-level persistence request.
+- Phase 5 is expected to clarify which live-state mutations are durable, committed-live, or transient.
+
+## Why Phase 6 Exists
+
+Synchronous persistence is a poor fit for live software. A mutation that changes state should not also have to block on filesystem timing, antivirus scans, slow disks, or transient IO failures. The app needs persistence to be reliable and observable, but not timing-sensitive.
+
+The resilience review calls this out because `SavePersistentState()` style behavior can create unnecessary stalls and makes recovery harder to reason about.
+
+Phase 6 should turn persistence into:
+
+- request
+- snapshot
+- background write
+- completion/failure observation
+
+## Goals
+
+Phase 6 should establish:
+
+- a queued persistence request path
+- debounced/coalesced durable-state snapshot writes
+- atomic file replacement for runtime-state saves where practical
+- structured completion/failure reporting
+- clear separation between state mutation and disk flush
+- deterministic shutdown flushing policy
+- tests for coalescing, snapshot selection, write failure, and shutdown behavior without rendering or DeckLink
+
+## Non-Goals
+
+Phase 6 should not require:
+
+- changing live-state layering rules
+- changing DeckLink/backend lifecycle
+- replacing stack preset semantics wholesale
+- adding cloud sync or external storage
+- building an unlimited historical state archive
+- making every write async immediately if a narrow compatibility path still needs a synchronous result
+
+## Target Model
+
+Phase 6 should make persistence a small pipeline:
+
+```text
+RuntimeCoordinator accepts mutation
+  -> publishes/returns persistence request
+  -> PersistenceWriter captures a durable snapshot
+  -> background worker debounces/coalesces writes
+  -> atomic write commits file
+  -> HealthTelemetry/runtime event records success or failure
+```
+
+The key rule is:
+
+- `RuntimeStore` owns durable state and serialization
+- `PersistenceWriter` owns when and how snapshots are written
+- `RuntimeCoordinator` owns whether a mutation requests persistence
+
+## Proposed Collaborators
+
+### `PersistenceWriter`
+
+Owns the worker thread, queue, debounce timer, and write execution.
+
+Responsibilities:
+
+- accept persistence requests
+- coalesce repeated runtime-state writes
+- request/build a durable snapshot from `RuntimeStore`
+- write to a temporary file and atomically replace the target
+- report success/failure observations
+- flush on shutdown according to policy
+
+Non-responsibilities:
+
+- deciding mutation validity
+- owning durable in-memory state
+- composing render snapshots
+- blocking render/backend timing paths
+
+### `PersistenceSnapshot`
+
+Immutable write input captured from durable state.
+
+Responsibilities:
+
+- contain serialized runtime-state text or structured data ready to serialize
+- identify target path and snapshot generation
+- preserve enough metadata for completion/failure diagnostics
+
+Non-responsibilities:
+
+- mutation policy
+- file IO
+
+### `PersistenceRequest`
+
+Small request object or event payload.
+
+Expected fields:
+
+- reason/action name
+- target kind: runtime state, preset, config if later needed
+- optional debounce key
+- force/flush flag for explicit save operations
+- generation or sequence
+
+## Write Policy
+
+### Runtime State
+
+Default policy:
+
+- coalesce repeated requests
+- debounce short bursts
+- write newest snapshot
+- report failures without blocking render/control paths
+
+### Stack Presets
+
+Preset save is more operator-explicit than routine runtime-state persistence.
+
+Initial policy options:
+
+- keep preset save synchronous while runtime-state persistence becomes async
+- or route preset writes through the same worker with a completion result for the caller
+
+Conservative Phase 6 default:
+
+- background runtime-state persistence first
+- leave preset save/load synchronous unless the implementation has a clean completion path
+
+### Shutdown
+
+Shutdown should explicitly decide:
+
+- flush latest pending snapshot before exit
+- skip flush if no pending durable change exists
+- report/write failure if flush fails
+- avoid indefinite hang on shutdown
+
+## Atomicity And Failure Handling
+
+Runtime-state writes should prefer:
+
+1. serialize snapshot content in memory
+2. write to `target.tmp`
+3. flush/close file
+4. replace target atomically where platform support allows
+5. retain or report backup/failure context if replacement fails
+
+Failures should not silently disappear. They should publish:
+
+- persistence target
+- reason/action
+- error message
+- whether a newer request is pending
+- whether the app is still running with unsaved changes
+
+## Migration Plan
+
+### Step 1. Name Persistence Requests
+
+Make request types and event payloads explicit enough that callers stop thinking in terms of direct disk writes.
+
+Initial target:
+
+- keep existing coordinator persistence decisions
+- introduce a `PersistenceRequest`/`PersistenceSnapshot` shape
+- document which requests are debounceable
+
+### Step 2. Extract Snapshot Writing From `RuntimeStore`
+
+Move file-write mechanics behind a helper while keeping serialization ownership in `RuntimeStore`.
+
+Initial target:
+
+- `RuntimeStore` can build serialized runtime-state snapshots
+- `PersistenceWriter` writes the snapshot
+- existing synchronous save path can call through the writer/helper during transition
+
+### Step 3. Add Debounced Background Worker
+
+Introduce a worker thread or queued task owner.
+
+Initial target:
+
+- repeated runtime-state requests coalesce
+- worker writes only latest pending snapshot
+- tests cover coalescing without filesystem where possible
+
+### Step 4. Add Atomic Write And Failure Reporting
+
+Make disk writes safer and observable.
+
+Initial target:
+
+- temp-file then replace
+- failure returned/published with structured reason
+- `HealthTelemetry` receives persistence warning state
+
+### Step 5. Wire Coordinator/Event Requests To Writer
+
+Route `RuntimePersistenceRequested` or coordinator persistence outcomes into the writer.
+
+Initial target:
+
+- accepted durable mutations request persistence
+- transient-only mutations do not
+- runtime reload/preset policies remain explicit
+
+### Step 6. Define Shutdown Flush
+
+Make app shutdown persistence behavior deterministic.
+
+Initial target:
+
+- stop accepting new requests
+- flush latest pending snapshot with bounded wait
+- report failure if flush fails
+
+## Testing Strategy
+
+Recommended tests:
+
+- repeated persistence requests coalesce into one write
+- newest snapshot wins after multiple mutations
+- transient-only mutation does not request persistence
+- write failure records an error and keeps unsaved state visible
+- shutdown flush writes pending snapshot
+- shutdown with no pending request does not write
+- preset save path remains explicit
+- temp-file replacement success/failure is handled
+
+Useful homes:
+
+- `RuntimeSubsystemTests` for coordinator persistence outcomes
+- a new `PersistenceWriterTests` target for worker/coalescing/write policy
+- filesystem tests using a temporary directory for atomic write behavior
+
+## Risks
+
+### Data Loss Risk
+
+Debouncing introduces a window where in-memory state is newer than disk. Shutdown flush and unsaved-state telemetry are the guardrails.
+
+### Complexity Risk
+
+A persistence worker can become a hidden second store if it owns mutable truth. It should own snapshots and write policy only.
+
+### Blocking Shutdown Risk
+
+Flushing forever on shutdown is not acceptable. Use bounded waits and visible failure reporting.
+
+### Preset Semantics Risk
+
+Operator-triggered preset save often feels like it should complete before reporting success. Keep preset behavior explicit rather than silently changing it.
+
+## Phase 6 Exit Criteria
+
+Phase 6 can be considered complete once the project can say:
+
+- [ ] durable mutations enqueue persistence instead of directly writing from mutation paths
+- [ ] runtime-state writes are debounced/coalesced
+- [ ] writes use temp-file/replace or equivalent atomic policy
+- [ ] persistence failures are reported through structured health/events
+- [ ] transient/live-only mutations do not request persistence
+- [ ] shutdown flush behavior is explicit and tested
+- [ ] `RuntimeStore` remains durable-state/serialization owner, not worker policy owner
+- [ ] persistence behavior has focused non-render tests
+
+## Open Questions
+
+- Should preset save remain synchronous, or move behind a completion-based async request?
+- What debounce interval is appropriate for routine runtime-state writes?
+- Should failed persistence retry automatically, or wait for the next mutation/request?
+- Should the app expose "unsaved changes" in the UI/health snapshot?
+- Should runtime config writes share this worker, or stay separate?
+
+## Short Version
+
+Phase 6 should make persistence boring, safe, and off the hot path.
+
+Mutations update in-memory durable state. Persistence requests are queued and coalesced. A background writer saves atomic snapshots and reports failures. Render, backend callbacks, and control ingress should not pay filesystem costs.
--- a/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md
+++ b/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md
@@ -0,0 +1,333 @@
+# Phase 7 Design: Backend Lifecycle And Playout
+
+This document expands Phase 7 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
+
+Phase 4 made the render thread the sole owner of normal runtime GL work, but output timing is still callback-coupled: DeckLink completion callbacks synchronously request render-thread output production before scheduling the next hardware frame. Phase 7 should make backend lifecycle, buffer policy, playout headroom, and recovery explicit.
+
+## Status
+
+- Phase 7 design package: proposed.
+- Phase 7 implementation: not started.
+- Current alignment: `VideoBackend`, `VideoIODevice`, `DeckLinkSession`, and `VideoPlayoutScheduler` exist. Phase 4 removed callback-thread GL ownership, but the DeckLink completion path still waits for render-thread output production.
+
+Current backend footholds:
+
+- `VideoBackend` wraps device discovery/configuration, start/stop, input callback handling, output completion handling, and telemetry publication.
+- `DeckLinkSession` owns DeckLink device handles, frame pool creation, preroll, keyer configuration, and scheduled playback.
+- `VideoPlayoutScheduler` owns basic schedule time generation and simple late/drop skip-ahead behavior.
+- `OpenGLVideoIOBridge` is the current adapter between `VideoBackend` and `RenderEngine`.
+- `HealthTelemetry` receives some signal, render, and pacing stats.
+
+## Why Phase 7 Exists
+
+The current output path works only while render/readback stays comfortably inside budget. A late render can make the callback late, which reduces device-side headroom, which makes the next callback more fragile.
+
+The resilience review calls this the main remaining live-resilience risk after Phase 4:
+
+- output playout is still effectively render-on-demand from the DeckLink completion callback
+- buffer pool size and preroll depth are not sourced from one policy
+- late/dropped recovery is a fixed skip rule
+- backend lifecycle is imperative rather than represented as explicit states
+
+Phase 7 should separate hardware timing from render production.
+
+## Goals
+
+Phase 7 should establish:
+
+- explicit backend lifecycle states and allowed transitions
+- one playout policy for frame pool size, preroll, headroom, and underrun behavior
+- a bounded producer/consumer output queue between render and DeckLink scheduling
+- lightweight DeckLink callbacks that dequeue/schedule/account rather than render
+- measured recovery from late/dropped frames
+- structured backend health reporting
+- tests for scheduler, queue, lifecycle, and underrun policy without DeckLink hardware
+
+## Non-Goals
+
+Phase 7 should not require:
+
+- a new renderer
+- changing shader/state composition
+- replacing DeckLink support with multiple backends
+- full telemetry UI redesign
+- removing every synchronous API immediately
+- perfect adaptive latency policy in the first pass
+
+## Target Timing Model
+
+The target model is producer/consumer playout:
+
+```text
+RenderEngine/render scheduler produces completed output frames
+  -> bounded ready-frame queue
+  -> VideoBackend consumes ready frames
+  -> DeckLink callback schedules already-prepared frames
+```
+
+The callback should not wait for rendering. It should:
+
+- record completion result
+- recycle/release completed buffers
+- dequeue a ready frame or apply underrun policy
+- schedule the next frame
+- publish backend timing/health observations
+
+## Target Lifecycle Model
+
+Suggested backend states:
+
+1. `Uninitialized`
+2. `Discovering`
+3. `Discovered`
+4. `Configuring`
+5. `Configured`
+6. `Prerolling`
+7. `Running`
+8. `Degraded`
+9. `Stopping`
+10. `Stopped`
+11. `Failed`
+
+Suggested transition rules:
+
+- `Uninitialized -> Discovering`
+- `Discovering -> Discovered | Failed`
+- `Discovered -> Configuring | Stopped`
+- `Configuring -> Configured | Failed`
+- `Configured -> Prerolling | Stopped`
+- `Prerolling -> Running | Failed | Stopping`
+- `Running -> Degraded | Stopping | Failed`
+- `Degraded -> Running | Stopping | Failed`
+- `Stopping -> Stopped`
+
+The exact enum can change, but the lifecycle should become observable and testable.
+
+## Proposed Collaborators
+
+### `VideoBackendStateMachine`
+
+Pure or mostly pure lifecycle transition helper.
+
+Responsibilities:
+
+- validate state transitions
+- produce transition observations
+- track failure reasons
+- keep start/stop/recovery behavior auditable
+
+Non-responsibilities:
+
+- DeckLink API calls
+- rendering
+- persistence
+
+### `PlayoutPolicy`
+
+Policy object for queue and timing behavior.
+
+Expected fields:
+
+- target preroll frames
+- maximum ready frames
+- minimum spare device buffers
+- underrun behavior
+- maximum catch-up frames
+- adaptive headroom enabled/disabled
+
+### `RenderOutputQueue`
+
+Bounded queue or ring for completed output frames.
+
+Responsibilities:
+
+- accept completed render outputs
+- expose ready frames for scheduling
+- track depth, drops, stale reuse, and underruns
+- keep ownership/lifetime clear between render and backend
+
+### `OutputFramePool`
+
+Backend-owned device buffer pool.
+
+Responsibilities:
+
+- own DeckLink mutable frames
+- expose available buffers for render/readback or scheduling
+- recycle completed frames
+- report spare-buffer depth
+
+### `PlayoutController`
+
+Coordinates policy, ready frames, device schedule times, and completion accounting.
+
+Responsibilities:
+
+- preroll frames
+- schedule next frame
+- handle late/drop/completed/flushed results
+- apply underrun policy
+- publish timing state
+
+## Output Queue Policy
+
+The initial output queue should be small and bounded.
+
+Candidate defaults:
+
+- target ready frames: 2-3
+- max ready frames: 3-5
+- underrun: reuse last completed frame if available, otherwise black
+- late/drop: increase degraded counters and optionally increase headroom within limits
+
+The exact numbers should be measured, but the policy should live in one place instead of being split across constants.
+
+## Underrun Policy
+
+When no fresh rendered frame is available, options are:
+
+1. reuse newest completed frame
+2. reuse last scheduled frame
+3. schedule black/degraded frame
+4. skip/catch up schedule time
+
+Phase 7 should pick one default and make it visible in telemetry. Reusing the newest completed frame is often the best first policy for live visual continuity, but key/fill behavior may require careful testing.
+
+## Migration Plan
+
+### Step 1. Name Lifecycle States
+
+Introduce backend state enum and transition reporting without changing scheduling behavior much.
+
+Initial target:
+
+- state changes are explicit
+- invalid transitions are detectable
+- tests cover allowed transitions
+
+### Step 2. Create Playout Policy Object
+
+Unify fixed constants and scheduler assumptions.
+
+Initial target:
+
+- frame pool size derives from policy
+- preroll count derives from policy
+- late/drop recovery reads policy
+
+### Step 3. Add Ready Output Queue
+
+Introduce a bounded queue for completed output frames.
+
+Initial target:
+
+- pure queue tests
+- explicit depth/underrun metrics
+- no DeckLink dependency in queue tests
+
+### Step 4. Move Callback Toward Dequeue/Schedule
+
+Stop producing frames directly in the completion callback path.
+
+Transitional target:
+
+- callback wakes/schedules a backend worker
+- worker consumes ready frames
+
+Final target:
+
+- callback only records, recycles, dequeues, schedules
+
+### Step 5. Make Render Produce Ahead
+
+Teach render/output code to keep the ready queue filled to target headroom.
+
+Initial target:
+
+- render thread produces on demand until queue has target depth
+- callback does not synchronously wait for fresh render
+- stale/black fallback is explicit on underrun
+
+### Step 6. Replace Fixed Late/Drop Recovery
+
+Replace fixed `+2` schedule-index recovery with measured lag/headroom accounting.
+
+Initial target:
+
+- track scheduled index, completed index, queue depth, late streak, drop streak
+- recovery decisions use measured lag
+
+### Step 7. Route Backend Health Structurally
+
+Publish backend lifecycle, queue depth, underrun, late/drop, and degraded-state observations through `HealthTelemetry`.
+
+## Testing Strategy
+
+Recommended tests:
+
+- allowed lifecycle transitions pass
+- invalid lifecycle transitions fail
+- playout policy derives frame pool/preroll sizes consistently
+- output queue preserves ordering
+- bounded output queue rejects/drops according to policy
+- underrun reuses last frame or black according to policy
+- late/drop accounting updates degraded state
+- scheduler catch-up uses measured lag, not fixed skip
+- stop drains/recycles device-frame ownership in pure fakes
+
+Useful homes:
+
+- `VideoPlayoutSchedulerTests` for scheduler evolution
+- `VideoIODeviceFakeTests` for fake backend lifecycle
+- a new `VideoBackendStateMachineTests`
+- a new `RenderOutputQueueTests`
+
+## Risks
+
+### Latency Risk
+
+More headroom means more latency. Phase 7 should make latency a visible policy choice.
+
+### Buffer Lifetime Risk
+
+Render and backend will share ownership boundaries around output buffers. Frame ownership must be explicit to avoid reuse while hardware still owns a frame.
+
+### Underrun Policy Risk
+
+Reusing stale frames can be visually acceptable, but wrong key/fill behavior may be worse than black. Test with real output.
+
+### Callback Thread Risk
+
+Even after decoupling render, callback work must stay small and bounded.
+
+### Scope Risk
+
+Backend lifecycle and playout queue are related, but either can grow large. Implement in small, testable slices.
+
+## Phase 7 Exit Criteria
+
+Phase 7 can be considered complete once the project can say:
+
+- [ ] backend lifecycle states and transitions are explicit
+- [ ] playout policy owns preroll, pool size, headroom, and underrun behavior
+- [ ] output callbacks no longer synchronously wait for render production
+- [ ] render produces completed output frames into a bounded queue
+- [ ] underrun behavior is explicit and observable
+- [ ] late/drop recovery is measured rather than fixed skip-only
+- [ ] backend health reports lifecycle, queue, underrun, late, and dropped state
+- [ ] queue/lifecycle/scheduler behavior has non-DeckLink tests
+
+## Open Questions
+
+- What should the default ready-frame depth be at 30fps and 60fps?
+- Should underrun reuse last completed, last scheduled, or black?
+- Should output queue depth be user-configurable?
+- Should render cadence be driven by backend demand, a timer, or queue-fill pressure?
+- How should external keying influence stale-frame/black fallback?
+- Should input and output lifecycle states be separate endpoints under one backend shell?
+
+## Short Version
+
+Phase 7 should stop making DeckLink callbacks wait for render.
+
+Render produces ahead into a bounded queue. The backend consumes ready frames according to explicit lifecycle and playout policy. Queue depth, underruns, late frames, dropped frames, and degraded states become measured and visible.
--- a/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
+++ b/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
@@ -0,0 +1,367 @@
+# Phase 8 Design: Health, Telemetry, And Operational Reporting
+
+This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
+
+Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
+
+## Status
+
+- Phase 8 design package: proposed.
+- Phase 8 implementation: not started.
+- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
+
+Current telemetry footholds:
+
+- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
+- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
+- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
+- Render and backend paths already collect some timing and late/drop counts.
+
+## Why Phase 8 Exists
+
+The app can detect many problems, but operational visibility is still fragmented:
+
+- some failures show modal dialogs
+- some warnings go only to `OutputDebugStringA`
+- some timing lives in health telemetry
+- some observations are runtime events
+- UI-facing state combines operational state with runtime state
+- repeated warnings are not uniformly deduplicated, classified, or summarized
+
+Live software needs to answer:
+
+- what is healthy right now?
+- what is degraded but still running?
+- what recently failed?
+- which subsystem is under timing pressure?
+- what should an operator see versus what should an engineer debug?
+
+## Goals
+
+Phase 8 should establish:
+
+- structured log entries with subsystem, severity, category, timestamp, and message
+- subsystem-scoped health states
+- bounded recent warning/error history
+- timing samples, counters, and gauges for render/control/backend/persistence
+- stable health snapshots for UI/diagnostics
+- direct debug-output paths wrapped by structured telemetry
+- low-overhead reporting from render and callback paths
+- tests for severity, deduplication, counters, snapshots, and bounded retention
+
+## Non-Goals
+
+Phase 8 should not require:
+
+- a cloud telemetry service
+- external metrics database
+- a full UI redesign
+- automatic recovery policy owned by telemetry
+- unbounded logs or time-series storage
+- replacing every `MessageBoxA` on day one
+
+Telemetry observes and reports. It does not become the control plane.
+
+## Target Model
+
+Suggested core model:
+
+- `TelemetrySubsystem`
+- `TelemetrySeverity`
+- `TelemetryLogEntry`
+- `TelemetryWarningRecord`
+- `TelemetryCounter`
+- `TelemetryGauge`
+- `TelemetryTimingSample`
+- `SubsystemHealthState`
+- `HealthSnapshot`
+
+Important distinction:
+
+- raw observations are append/update operations
+- health snapshots are derived read models
+
+## Health Domains
+
+At minimum:
+
+- `ApplicationShell`
+- `RuntimeStore`
+- `RuntimeCoordinator`
+- `RuntimeSnapshotProvider`
+- `ControlServices`
+- `RenderEngine`
+- `VideoBackend`
+- `Persistence`
+
+Suggested states:
+
+- `Healthy`
+- `Warning`
+- `Degraded`
+- `Error`
+- `Unavailable`
+
+The overall app health should be derived from subsystem states.
+
+## Proposed Interfaces
+
+### Write Interface
+
+Target operations:
+
+- `AppendLog(...)`
+- `RaiseWarning(...)`
+- `ClearWarning(...)`
+- `RecordCounterDelta(...)`
+- `RecordGauge(...)`
+- `RecordTimingSample(...)`
+- `ReportSubsystemState(...)`
+
+Hot-path producers should be able to record observations cheaply and return.
+
+### Read Interface
+
+Target operations:
+
+- `BuildHealthSnapshot()`
+- `GetSubsystemHealth(...)`
+- `GetActiveWarnings()`
+- `GetRecentLogs(...)`
+- `GetTimingSummary(...)`
+
+UI/control services should consume snapshots, not scrape subsystem internals.
+
+## Producer Expectations
+
+### `RenderEngine`
+
+Expected observations:
+
+- render frame duration
+- input upload duration/count/drop/coalescing
+- output request latency
+- readback duration
+- synchronous readback fallback count
+- preview present cost/skips
+- wrong-thread diagnostics
+
+### `VideoBackend`
+
+Expected observations:
+
+- lifecycle state
+- playout queue depth
+- output underruns
+- late/dropped/flushed/completed counts
+- input signal state
+- output model/mode status
+- spare buffer depth
+
+### `ControlServices`
+
+Expected observations:
+
+- OSC decode errors
+- control request failures
+- websocket broadcast failures
+- ingress queue depth
+- file-watch/reload events
+- service start/stop state
+
+### `RuntimeCoordinator`
+
+Expected observations:
+
+- rejected mutation count and reasons
+- reload requests
+- preset failures
+- transient-state invalidations
+- persistence request publication
+
+### `RuntimeSnapshotProvider`
+
+Expected observations:
+
+- snapshot publish duration
+- snapshot version churn
+- stale snapshot/fallback behavior
+- publish failures
+
+### `PersistenceWriter`
+
+Expected observations:
+
+- pending write count
+- coalesced write count
+- write duration
+- write failure
+- unsaved durable changes
+- shutdown flush result
+
+## Logging Policy
+
+Direct string logging can remain as an output sink, but not as the source of truth.
+
+Target flow:
+
+```text
+subsystem reports structured warning/log
+  -> HealthTelemetry stores bounded structured entry
+  -> optional debug sink prints text
+  -> UI/diagnostics reads health snapshot
+```
+
+Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
+
+## Snapshot Contract
+
+`HealthSnapshot` should answer:
+
+- overall health
+- subsystem health states
+- active warnings
+- recent important logs
+- key counters
+- key timing summaries
+- degraded-state reasons
+
+The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.
+
+## Migration Plan
+
+### Step 1. Expand Health Model Types
+
+Add structured subsystem/severity/category types and snapshot models.
+
+Initial target:
+
+- keep existing health fields
+- add structured warning/log/counter/gauge containers
+- add tests for bounded retention and deduplication
+
+### Step 2. Wrap Direct Warning Paths
+
+Route common direct logs through telemetry first.
+
+Initial candidates:
+
+- backend fallback warnings
+- screenshot write failures
+- OSC decode/dispatch failures
+- render-thread request failures
+
+### Step 3. Add Subsystem Health States
+
+Let subsystems report state transitions.
+
+Initial target:
+
+- `RenderEngine`: healthy/degraded on render-thread request failures
+- `VideoBackend`: configured/running/degraded/no-input/dropping
+- `ControlServices`: running/degraded/stopped
+- `Persistence`: clean/pending/error
+
+### Step 4. Split Timing Into Named Metrics
+
+Move from broad timing fields to named samples/gauges.
+
+Initial target:
+
+- render duration
+- readback duration/fallback count
+- output request latency
+- playout completion interval
+- event queue depth
+- persistence write duration
+
+### Step 5. Publish Health Snapshot
+
+Expose `HealthTelemetry` snapshot through control/runtime presentation.
+
+Initial target:
+
+- UI can distinguish runtime state from operational health
+- active warnings are visible
+- recent degraded reasons are visible
+
+### Step 6. Add Operational Tests
+
+Cover:
+
+- warning raise/clear
+- repeated warning coalescing
+- counter/gauge updates
+- health derivation
+- bounded log retention
+- snapshot stability
+
+## Testing Strategy
+
+Recommended tests:
+
+- warning raised appears in active warnings
+- warning clear removes active warning but preserves history
+- repeated warning increments count and updates last-seen time
+- bounded log keeps newest entries
+- subsystem `Error` makes overall health `Error`
+- subsystem `Degraded` makes overall health degraded if no error exists
+- timing sample updates summary
+- counter delta accumulates
+- health snapshot is read-only/stable
+
+Useful homes:
+
+- `HealthTelemetryTests`
+- `RuntimeEventTypeTests` for observation event payloads
+- future integration tests for control-service health publication
+
+## Risks
+
+### Telemetry Becomes Behavior
+
+Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
+
+### Too Much Hot-Path Cost
+
+Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
+
+### String-Only Logging
+
+Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
+
+### Snapshot Bloat
+
+Health snapshots should summarize operational state, not duplicate full runtime/project state.
+
+### Alert Noise
+
+Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
+
+## Phase 8 Exit Criteria
+
+Phase 8 can be considered complete once the project can say:
+
+- [ ] major subsystems publish structured health/telemetry observations
+- [ ] active warnings and recent logs are structured and bounded
+- [ ] subsystem health states roll up to an overall health state
+- [ ] render/backend/control/persistence timing metrics are named and visible
+- [ ] direct debug-string warning paths are wrapped or retired for major cases
+- [ ] UI/control diagnostics can consume a stable health snapshot
+- [ ] telemetry write paths are cheap enough for render/callback use
+- [ ] telemetry behavior has focused tests
+
+## Open Questions
+
+- Should debug output remain enabled by default as a telemetry sink?
+- How many recent logs/warnings should be retained in memory?
+- Should timing summaries store raw samples, rolling windows, or both?
+- Should warning thresholds be declared centrally or owned by each subsystem?
+- Should health snapshots be published with runtime state or on a separate endpoint/channel?
+- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
+
+## Short Version
+
+Phase 8 should make the app diagnosable.
+
+Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.