Files
video-shader-toys/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
Aiden e8a3805fff
Some checks failed
CI / React UI Build (push) Successful in 10s
CI / Native Windows Build And Tests (push) Successful in 2m40s
CI / Windows Release Package (push) Has been cancelled
Doc update again
2026-05-11 18:48:55 +10:00

368 lines
10 KiB
Markdown

# Phase 8 Design: Health, Telemetry, And Operational Reporting
This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
## Status
- Phase 8 design package: proposed.
- Phase 8 implementation: not started.
- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
Current telemetry footholds:
- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
- Render and backend paths already collect some timing and late/drop counts.
## Why Phase 8 Exists
The app can detect many problems, but operational visibility is still fragmented:
- some failures show modal dialogs
- some warnings go only to `OutputDebugStringA`
- some timing lives in health telemetry
- some observations are runtime events
- UI-facing state combines operational state with runtime state
- repeated warnings are not uniformly deduplicated, classified, or summarized
Live software needs to answer:
- what is healthy right now?
- what is degraded but still running?
- what recently failed?
- which subsystem is under timing pressure?
- what should an operator see versus what should an engineer debug?
## Goals
Phase 8 should establish:
- structured log entries with subsystem, severity, category, timestamp, and message
- subsystem-scoped health states
- bounded recent warning/error history
- timing samples, counters, and gauges for render/control/backend/persistence
- stable health snapshots for UI/diagnostics
- direct debug-output paths wrapped by structured telemetry
- low-overhead reporting from render and callback paths
- tests for severity, deduplication, counters, snapshots, and bounded retention
## Non-Goals
Phase 8 should not require:
- a cloud telemetry service
- external metrics database
- a full UI redesign
- automatic recovery policy owned by telemetry
- unbounded logs or time-series storage
- replacing every `MessageBoxA` on day one
Telemetry observes and reports. It does not become the control plane.
## Target Model
Suggested core model:
- `TelemetrySubsystem`
- `TelemetrySeverity`
- `TelemetryLogEntry`
- `TelemetryWarningRecord`
- `TelemetryCounter`
- `TelemetryGauge`
- `TelemetryTimingSample`
- `SubsystemHealthState`
- `HealthSnapshot`
Important distinction:
- raw observations are append/update operations
- health snapshots are derived read models
## Health Domains
At minimum:
- `ApplicationShell`
- `RuntimeStore`
- `RuntimeCoordinator`
- `RuntimeSnapshotProvider`
- `ControlServices`
- `RenderEngine`
- `VideoBackend`
- `Persistence`
Suggested states:
- `Healthy`
- `Warning`
- `Degraded`
- `Error`
- `Unavailable`
The overall app health should be derived from subsystem states.
## Proposed Interfaces
### Write Interface
Target operations:
- `AppendLog(...)`
- `RaiseWarning(...)`
- `ClearWarning(...)`
- `RecordCounterDelta(...)`
- `RecordGauge(...)`
- `RecordTimingSample(...)`
- `ReportSubsystemState(...)`
Hot-path producers should be able to record observations cheaply and return.
### Read Interface
Target operations:
- `BuildHealthSnapshot()`
- `GetSubsystemHealth(...)`
- `GetActiveWarnings()`
- `GetRecentLogs(...)`
- `GetTimingSummary(...)`
UI/control services should consume snapshots, not scrape subsystem internals.
## Producer Expectations
### `RenderEngine`
Expected observations:
- render frame duration
- input upload duration/count/drop/coalescing
- output request latency
- readback duration
- synchronous readback fallback count
- preview present cost/skips
- wrong-thread diagnostics
### `VideoBackend`
Expected observations:
- lifecycle state
- playout queue depth
- output underruns
- late/dropped/flushed/completed counts
- input signal state
- output model/mode status
- spare buffer depth
### `ControlServices`
Expected observations:
- OSC decode errors
- control request failures
- websocket broadcast failures
- ingress queue depth
- file-watch/reload events
- service start/stop state
### `RuntimeCoordinator`
Expected observations:
- rejected mutation count and reasons
- reload requests
- preset failures
- transient-state invalidations
- persistence request publication
### `RuntimeSnapshotProvider`
Expected observations:
- snapshot publish duration
- snapshot version churn
- stale snapshot/fallback behavior
- publish failures
### `PersistenceWriter`
Expected observations:
- pending write count
- coalesced write count
- write duration
- write failure
- unsaved durable changes
- shutdown flush result
## Logging Policy
Direct string logging can remain as an output sink, but not as the source of truth.
Target flow:
```text
subsystem reports structured warning/log
-> HealthTelemetry stores bounded structured entry
-> optional debug sink prints text
-> UI/diagnostics reads health snapshot
```
Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
## Snapshot Contract
`HealthSnapshot` should answer:
- overall health
- subsystem health states
- active warnings
- recent important logs
- key counters
- key timing summaries
- degraded-state reasons
The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.
## Migration Plan
### Step 1. Expand Health Model Types
Add structured subsystem/severity/category types and snapshot models.
Initial target:
- keep existing health fields
- add structured warning/log/counter/gauge containers
- add tests for bounded retention and deduplication
### Step 2. Wrap Direct Warning Paths
Route common direct logs through telemetry first.
Initial candidates:
- backend fallback warnings
- screenshot write failures
- OSC decode/dispatch failures
- render-thread request failures
### Step 3. Add Subsystem Health States
Let subsystems report state transitions.
Initial target:
- `RenderEngine`: healthy/degraded on render-thread request failures
- `VideoBackend`: configured/running/degraded/no-input/dropping
- `ControlServices`: running/degraded/stopped
- `Persistence`: clean/pending/error
### Step 4. Split Timing Into Named Metrics
Move from broad timing fields to named samples/gauges.
Initial target:
- render duration
- readback duration/fallback count
- output request latency
- playout completion interval
- event queue depth
- persistence write duration
### Step 5. Publish Health Snapshot
Expose `HealthTelemetry` snapshot through control/runtime presentation.
Initial target:
- UI can distinguish runtime state from operational health
- active warnings are visible
- recent degraded reasons are visible
### Step 6. Add Operational Tests
Cover:
- warning raise/clear
- repeated warning coalescing
- counter/gauge updates
- health derivation
- bounded log retention
- snapshot stability
## Testing Strategy
Recommended tests:
- warning raised appears in active warnings
- warning clear removes active warning but preserves history
- repeated warning increments count and updates last-seen time
- bounded log keeps newest entries
- subsystem `Error` makes overall health `Error`
- subsystem `Degraded` makes overall health degraded if no error exists
- timing sample updates summary
- counter delta accumulates
- health snapshot is read-only/stable
Useful homes:
- `HealthTelemetryTests`
- `RuntimeEventTypeTests` for observation event payloads
- future integration tests for control-service health publication
## Risks
### Telemetry Becomes Behavior
Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
### Too Much Hot-Path Cost
Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
### String-Only Logging
Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
### Snapshot Bloat
Health snapshots should summarize operational state, not duplicate full runtime/project state.
### Alert Noise
Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
## Phase 8 Exit Criteria
Phase 8 can be considered complete once the project can say:
- [ ] major subsystems publish structured health/telemetry observations
- [ ] active warnings and recent logs are structured and bounded
- [ ] subsystem health states roll up to an overall health state
- [ ] render/backend/control/persistence timing metrics are named and visible
- [ ] direct debug-string warning paths are wrapped or retired for major cases
- [ ] UI/control diagnostics can consume a stable health snapshot
- [ ] telemetry write paths are cheap enough for render/callback use
- [ ] telemetry behavior has focused tests
## Open Questions
- Should debug output remain enabled by default as a telemetry sink?
- How many recent logs/warnings should be retained in memory?
- Should timing summaries store raw samples, rolling windows, or both?
- Should warning thresholds be declared centrally or owned by each subsystem?
- Should health snapshots be published with runtime state or on a separate endpoint/channel?
- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
## Short Version
Phase 8 should make the app diagnosable.
Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.