# Phase 8 Design: Health, Telemetry, And Operational Reporting This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target. Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields. ## Status - Phase 8 design package: proposed. - Phase 8 implementation: not started. - Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing. Current telemetry footholds: - `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting. - `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes. - `RuntimeStatePresenter` includes some health/performance fields in runtime-state output. - Render and backend paths already collect some timing and late/drop counts. ## Why Phase 8 Exists The app can detect many problems, but operational visibility is still fragmented: - some failures show modal dialogs - some warnings go only to `OutputDebugStringA` - some timing lives in health telemetry - some observations are runtime events - UI-facing state combines operational state with runtime state - repeated warnings are not uniformly deduplicated, classified, or summarized Live software needs to answer: - what is healthy right now? - what is degraded but still running? - what recently failed? - which subsystem is under timing pressure? - what should an operator see versus what should an engineer debug? ## Goals Phase 8 should establish: - structured log entries with subsystem, severity, category, timestamp, and message - subsystem-scoped health states - bounded recent warning/error history - timing samples, counters, and gauges for render/control/backend/persistence - stable health snapshots for UI/diagnostics - direct debug-output paths wrapped by structured telemetry - low-overhead reporting from render and callback paths - tests for severity, deduplication, counters, snapshots, and bounded retention ## Non-Goals Phase 8 should not require: - a cloud telemetry service - external metrics database - a full UI redesign - automatic recovery policy owned by telemetry - unbounded logs or time-series storage - replacing every `MessageBoxA` on day one Telemetry observes and reports. It does not become the control plane. ## Target Model Suggested core model: - `TelemetrySubsystem` - `TelemetrySeverity` - `TelemetryLogEntry` - `TelemetryWarningRecord` - `TelemetryCounter` - `TelemetryGauge` - `TelemetryTimingSample` - `SubsystemHealthState` - `HealthSnapshot` Important distinction: - raw observations are append/update operations - health snapshots are derived read models ## Health Domains At minimum: - `ApplicationShell` - `RuntimeStore` - `RuntimeCoordinator` - `RuntimeSnapshotProvider` - `ControlServices` - `RenderEngine` - `VideoBackend` - `Persistence` Suggested states: - `Healthy` - `Warning` - `Degraded` - `Error` - `Unavailable` The overall app health should be derived from subsystem states. ## Proposed Interfaces ### Write Interface Target operations: - `AppendLog(...)` - `RaiseWarning(...)` - `ClearWarning(...)` - `RecordCounterDelta(...)` - `RecordGauge(...)` - `RecordTimingSample(...)` - `ReportSubsystemState(...)` Hot-path producers should be able to record observations cheaply and return. ### Read Interface Target operations: - `BuildHealthSnapshot()` - `GetSubsystemHealth(...)` - `GetActiveWarnings()` - `GetRecentLogs(...)` - `GetTimingSummary(...)` UI/control services should consume snapshots, not scrape subsystem internals. ## Producer Expectations ### `RenderEngine` Expected observations: - render frame duration - input upload duration/count/drop/coalescing - output request latency - readback duration - synchronous readback fallback count - preview present cost/skips - wrong-thread diagnostics ### `VideoBackend` Expected observations: - lifecycle state - playout queue depth - output underruns - late/dropped/flushed/completed counts - input signal state - output model/mode status - spare buffer depth ### `ControlServices` Expected observations: - OSC decode errors - control request failures - websocket broadcast failures - ingress queue depth - file-watch/reload events - service start/stop state ### `RuntimeCoordinator` Expected observations: - rejected mutation count and reasons - reload requests - preset failures - transient-state invalidations - persistence request publication ### `RuntimeSnapshotProvider` Expected observations: - snapshot publish duration - snapshot version churn - stale snapshot/fallback behavior - publish failures ### `PersistenceWriter` Expected observations: - pending write count - coalesced write count - write duration - write failure - unsaved durable changes - shutdown flush result ## Logging Policy Direct string logging can remain as an output sink, but not as the source of truth. Target flow: ```text subsystem reports structured warning/log -> HealthTelemetry stores bounded structured entry -> optional debug sink prints text -> UI/diagnostics reads health snapshot ``` Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps. ## Snapshot Contract `HealthSnapshot` should answer: - overall health - subsystem health states - active warnings - recent important logs - key counters - key timing summaries - degraded-state reasons The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models. ## Migration Plan ### Step 1. Expand Health Model Types Add structured subsystem/severity/category types and snapshot models. Initial target: - keep existing health fields - add structured warning/log/counter/gauge containers - add tests for bounded retention and deduplication ### Step 2. Wrap Direct Warning Paths Route common direct logs through telemetry first. Initial candidates: - backend fallback warnings - screenshot write failures - OSC decode/dispatch failures - render-thread request failures ### Step 3. Add Subsystem Health States Let subsystems report state transitions. Initial target: - `RenderEngine`: healthy/degraded on render-thread request failures - `VideoBackend`: configured/running/degraded/no-input/dropping - `ControlServices`: running/degraded/stopped - `Persistence`: clean/pending/error ### Step 4. Split Timing Into Named Metrics Move from broad timing fields to named samples/gauges. Initial target: - render duration - readback duration/fallback count - output request latency - playout completion interval - event queue depth - persistence write duration ### Step 5. Publish Health Snapshot Expose `HealthTelemetry` snapshot through control/runtime presentation. Initial target: - UI can distinguish runtime state from operational health - active warnings are visible - recent degraded reasons are visible ### Step 6. Add Operational Tests Cover: - warning raise/clear - repeated warning coalescing - counter/gauge updates - health derivation - bounded log retention - snapshot stability ## Testing Strategy Recommended tests: - warning raised appears in active warnings - warning clear removes active warning but preserves history - repeated warning increments count and updates last-seen time - bounded log keeps newest entries - subsystem `Error` makes overall health `Error` - subsystem `Degraded` makes overall health degraded if no error exists - timing sample updates summary - counter delta accumulates - health snapshot is read-only/stable Useful homes: - `HealthTelemetryTests` - `RuntimeEventTypeTests` for observation event payloads - future integration tests for control-service health publication ## Risks ### Telemetry Becomes Behavior Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation. ### Too Much Hot-Path Cost Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths. ### String-Only Logging Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class. ### Snapshot Bloat Health snapshots should summarize operational state, not duplicate full runtime/project state. ### Alert Noise Without deduplication and severity discipline, operator-facing health can become noisy and ignored. ## Phase 8 Exit Criteria Phase 8 can be considered complete once the project can say: - [ ] major subsystems publish structured health/telemetry observations - [ ] active warnings and recent logs are structured and bounded - [ ] subsystem health states roll up to an overall health state - [ ] render/backend/control/persistence timing metrics are named and visible - [ ] direct debug-string warning paths are wrapped or retired for major cases - [ ] UI/control diagnostics can consume a stable health snapshot - [ ] telemetry write paths are cheap enough for render/callback use - [ ] telemetry behavior has focused tests ## Open Questions - Should debug output remain enabled by default as a telemetry sink? - How many recent logs/warnings should be retained in memory? - Should timing summaries store raw samples, rolling windows, or both? - Should warning thresholds be declared centrally or owned by each subsystem? - Should health snapshots be published with runtime state or on a separate endpoint/channel? - Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink? ## Short Version Phase 8 should make the app diagnosable. Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.