Doc update again

2026-05-11 18:48:55 +10:00
parent 99fd903144
commit e8a3805fff
4 changed files with 1013 additions and 0 deletions
--- a/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
+++ b/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
@@ -0,0 +1,367 @@
+# Phase 8 Design: Health, Telemetry, And Operational Reporting
+
+This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
+
+Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
+
+## Status
+
+- Phase 8 design package: proposed.
+- Phase 8 implementation: not started.
+- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
+
+Current telemetry footholds:
+
+- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
+- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
+- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
+- Render and backend paths already collect some timing and late/drop counts.
+
+## Why Phase 8 Exists
+
+The app can detect many problems, but operational visibility is still fragmented:
+
+- some failures show modal dialogs
+- some warnings go only to `OutputDebugStringA`
+- some timing lives in health telemetry
+- some observations are runtime events
+- UI-facing state combines operational state with runtime state
+- repeated warnings are not uniformly deduplicated, classified, or summarized
+
+Live software needs to answer:
+
+- what is healthy right now?
+- what is degraded but still running?
+- what recently failed?
+- which subsystem is under timing pressure?
+- what should an operator see versus what should an engineer debug?
+
+## Goals
+
+Phase 8 should establish:
+
+- structured log entries with subsystem, severity, category, timestamp, and message
+- subsystem-scoped health states
+- bounded recent warning/error history
+- timing samples, counters, and gauges for render/control/backend/persistence
+- stable health snapshots for UI/diagnostics
+- direct debug-output paths wrapped by structured telemetry
+- low-overhead reporting from render and callback paths
+- tests for severity, deduplication, counters, snapshots, and bounded retention
+
+## Non-Goals
+
+Phase 8 should not require:
+
+- a cloud telemetry service
+- external metrics database
+- a full UI redesign
+- automatic recovery policy owned by telemetry
+- unbounded logs or time-series storage
+- replacing every `MessageBoxA` on day one
+
+Telemetry observes and reports. It does not become the control plane.
+
+## Target Model
+
+Suggested core model:
+
+- `TelemetrySubsystem`
+- `TelemetrySeverity`
+- `TelemetryLogEntry`
+- `TelemetryWarningRecord`
+- `TelemetryCounter`
+- `TelemetryGauge`
+- `TelemetryTimingSample`
+- `SubsystemHealthState`
+- `HealthSnapshot`
+
+Important distinction:
+
+- raw observations are append/update operations
+- health snapshots are derived read models
+
+## Health Domains
+
+At minimum:
+
+- `ApplicationShell`
+- `RuntimeStore`
+- `RuntimeCoordinator`
+- `RuntimeSnapshotProvider`
+- `ControlServices`
+- `RenderEngine`
+- `VideoBackend`
+- `Persistence`
+
+Suggested states:
+
+- `Healthy`
+- `Warning`
+- `Degraded`
+- `Error`
+- `Unavailable`
+
+The overall app health should be derived from subsystem states.
+
+## Proposed Interfaces
+
+### Write Interface
+
+Target operations:
+
+- `AppendLog(...)`
+- `RaiseWarning(...)`
+- `ClearWarning(...)`
+- `RecordCounterDelta(...)`
+- `RecordGauge(...)`
+- `RecordTimingSample(...)`
+- `ReportSubsystemState(...)`
+
+Hot-path producers should be able to record observations cheaply and return.
+
+### Read Interface
+
+Target operations:
+
+- `BuildHealthSnapshot()`
+- `GetSubsystemHealth(...)`
+- `GetActiveWarnings()`
+- `GetRecentLogs(...)`
+- `GetTimingSummary(...)`
+
+UI/control services should consume snapshots, not scrape subsystem internals.
+
+## Producer Expectations
+
+### `RenderEngine`
+
+Expected observations:
+
+- render frame duration
+- input upload duration/count/drop/coalescing
+- output request latency
+- readback duration
+- synchronous readback fallback count
+- preview present cost/skips
+- wrong-thread diagnostics
+
+### `VideoBackend`
+
+Expected observations:
+
+- lifecycle state
+- playout queue depth
+- output underruns
+- late/dropped/flushed/completed counts
+- input signal state
+- output model/mode status
+- spare buffer depth
+
+### `ControlServices`
+
+Expected observations:
+
+- OSC decode errors
+- control request failures
+- websocket broadcast failures
+- ingress queue depth
+- file-watch/reload events
+- service start/stop state
+
+### `RuntimeCoordinator`
+
+Expected observations:
+
+- rejected mutation count and reasons
+- reload requests
+- preset failures
+- transient-state invalidations
+- persistence request publication
+
+### `RuntimeSnapshotProvider`
+
+Expected observations:
+
+- snapshot publish duration
+- snapshot version churn
+- stale snapshot/fallback behavior
+- publish failures
+
+### `PersistenceWriter`
+
+Expected observations:
+
+- pending write count
+- coalesced write count
+- write duration
+- write failure
+- unsaved durable changes
+- shutdown flush result
+
+## Logging Policy
+
+Direct string logging can remain as an output sink, but not as the source of truth.
+
+Target flow:
+
+```text
+subsystem reports structured warning/log
+  -> HealthTelemetry stores bounded structured entry
+  -> optional debug sink prints text
+  -> UI/diagnostics reads health snapshot
+```
+
+Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
+
+## Snapshot Contract
+
+`HealthSnapshot` should answer:
+
+- overall health
+- subsystem health states
+- active warnings
+- recent important logs
+- key counters
+- key timing summaries
+- degraded-state reasons
+
+The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.
+
+## Migration Plan
+
+### Step 1. Expand Health Model Types
+
+Add structured subsystem/severity/category types and snapshot models.
+
+Initial target:
+
+- keep existing health fields
+- add structured warning/log/counter/gauge containers
+- add tests for bounded retention and deduplication
+
+### Step 2. Wrap Direct Warning Paths
+
+Route common direct logs through telemetry first.
+
+Initial candidates:
+
+- backend fallback warnings
+- screenshot write failures
+- OSC decode/dispatch failures
+- render-thread request failures
+
+### Step 3. Add Subsystem Health States
+
+Let subsystems report state transitions.
+
+Initial target:
+
+- `RenderEngine`: healthy/degraded on render-thread request failures
+- `VideoBackend`: configured/running/degraded/no-input/dropping
+- `ControlServices`: running/degraded/stopped
+- `Persistence`: clean/pending/error
+
+### Step 4. Split Timing Into Named Metrics
+
+Move from broad timing fields to named samples/gauges.
+
+Initial target:
+
+- render duration
+- readback duration/fallback count
+- output request latency
+- playout completion interval
+- event queue depth
+- persistence write duration
+
+### Step 5. Publish Health Snapshot
+
+Expose `HealthTelemetry` snapshot through control/runtime presentation.
+
+Initial target:
+
+- UI can distinguish runtime state from operational health
+- active warnings are visible
+- recent degraded reasons are visible
+
+### Step 6. Add Operational Tests
+
+Cover:
+
+- warning raise/clear
+- repeated warning coalescing
+- counter/gauge updates
+- health derivation
+- bounded log retention
+- snapshot stability
+
+## Testing Strategy
+
+Recommended tests:
+
+- warning raised appears in active warnings
+- warning clear removes active warning but preserves history
+- repeated warning increments count and updates last-seen time
+- bounded log keeps newest entries
+- subsystem `Error` makes overall health `Error`
+- subsystem `Degraded` makes overall health degraded if no error exists
+- timing sample updates summary
+- counter delta accumulates
+- health snapshot is read-only/stable
+
+Useful homes:
+
+- `HealthTelemetryTests`
+- `RuntimeEventTypeTests` for observation event payloads
+- future integration tests for control-service health publication
+
+## Risks
+
+### Telemetry Becomes Behavior
+
+Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
+
+### Too Much Hot-Path Cost
+
+Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
+
+### String-Only Logging
+
+Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
+
+### Snapshot Bloat
+
+Health snapshots should summarize operational state, not duplicate full runtime/project state.
+
+### Alert Noise
+
+Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
+
+## Phase 8 Exit Criteria
+
+Phase 8 can be considered complete once the project can say:
+
+- [ ] major subsystems publish structured health/telemetry observations
+- [ ] active warnings and recent logs are structured and bounded
+- [ ] subsystem health states roll up to an overall health state
+- [ ] render/backend/control/persistence timing metrics are named and visible
+- [ ] direct debug-string warning paths are wrapped or retired for major cases
+- [ ] UI/control diagnostics can consume a stable health snapshot
+- [ ] telemetry write paths are cheap enough for render/callback use
+- [ ] telemetry behavior has focused tests
+
+## Open Questions
+
+- Should debug output remain enabled by default as a telemetry sink?
+- How many recent logs/warnings should be retained in memory?
+- Should timing summaries store raw samples, rolling windows, or both?
+- Should warning thresholds be declared centrally or owned by each subsystem?
+- Should health snapshots be published with runtime state or on a separate endpoint/channel?
+- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
+
+## Short Version
+
+Phase 8 should make the app diagnosable.
+
+Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.