video-shader-toys/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md

# Phase 8 Design: Health, Telemetry, And Operational Reporting

This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.

Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.

## Status

- Phase 8 design package: proposed.
- Phase 8 implementation: not started.
- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.

Current telemetry footholds:

- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
- Render and backend paths already collect some timing and late/drop counts.

## Why Phase 8 Exists

The app can detect many problems, but operational visibility is still fragmented:

- some failures show modal dialogs
- some warnings go only to `OutputDebugStringA`
- some timing lives in health telemetry
- some observations are runtime events
- UI-facing state combines operational state with runtime state
- repeated warnings are not uniformly deduplicated, classified, or summarized

Live software needs to answer:

- what is healthy right now?
- what is degraded but still running?
- what recently failed?
- which subsystem is under timing pressure?
- what should an operator see versus what should an engineer debug?

## Goals

Phase 8 should establish:

- structured log entries with subsystem, severity, category, timestamp, and message
- subsystem-scoped health states
- bounded recent warning/error history
- timing samples, counters, and gauges for render/control/backend/persistence
- stable health snapshots for UI/diagnostics
- direct debug-output paths wrapped by structured telemetry
- low-overhead reporting from render and callback paths
- tests for severity, deduplication, counters, snapshots, and bounded retention

## Non-Goals

Phase 8 should not require:

- a cloud telemetry service
- external metrics database
- a full UI redesign
- automatic recovery policy owned by telemetry
- unbounded logs or time-series storage
- replacing every `MessageBoxA` on day one

Telemetry observes and reports. It does not become the control plane.

## Target Model

Suggested core model:

- `TelemetrySubsystem`
- `TelemetrySeverity`
- `TelemetryLogEntry`
- `TelemetryWarningRecord`
- `TelemetryCounter`
- `TelemetryGauge`
- `TelemetryTimingSample`
- `SubsystemHealthState`
- `HealthSnapshot`

Important distinction:

- raw observations are append/update operations
- health snapshots are derived read models

## Health Domains

At minimum:

- `ApplicationShell`
- `RuntimeStore`
- `RuntimeCoordinator`
- `RuntimeSnapshotProvider`
- `ControlServices`
- `RenderEngine`
- `VideoBackend`
- `Persistence`

Suggested states:

- `Healthy`
- `Warning`
- `Degraded`
- `Error`
- `Unavailable`

The overall app health should be derived from subsystem states.

## Proposed Interfaces

### Write Interface

Target operations:

- `AppendLog(...)`
- `RaiseWarning(...)`
- `ClearWarning(...)`
- `RecordCounterDelta(...)`
- `RecordGauge(...)`
- `RecordTimingSample(...)`
- `ReportSubsystemState(...)`

Hot-path producers should be able to record observations cheaply and return.

### Read Interface

Target operations:

- `BuildHealthSnapshot()`
- `GetSubsystemHealth(...)`
- `GetActiveWarnings()`
- `GetRecentLogs(...)`
- `GetTimingSummary(...)`

UI/control services should consume snapshots, not scrape subsystem internals.

## Producer Expectations

### `RenderEngine`

Expected observations:

- render frame duration
- input upload duration/count/drop/coalescing
- output request latency
- readback duration
- synchronous readback fallback count
- preview present cost/skips
- wrong-thread diagnostics

### `VideoBackend`

Expected observations:

- lifecycle state
- playout queue depth
- output underruns
- late/dropped/flushed/completed counts
- input signal state
- output model/mode status
- spare buffer depth

### `ControlServices`

Expected observations:

- OSC decode errors
- control request failures
- websocket broadcast failures
- ingress queue depth
- file-watch/reload events
- service start/stop state

### `RuntimeCoordinator`

Expected observations:

- rejected mutation count and reasons
- reload requests
- preset failures
- transient-state invalidations
- persistence request publication

### `RuntimeSnapshotProvider`

Expected observations:

- snapshot publish duration
- snapshot version churn
- stale snapshot/fallback behavior
- publish failures

### `PersistenceWriter`

Expected observations:

- pending write count
- coalesced write count
- write duration
- write failure
- unsaved durable changes
- shutdown flush result

## Logging Policy

Direct string logging can remain as an output sink, but not as the source of truth.

Target flow:

```text
subsystem reports structured warning/log
  -> HealthTelemetry stores bounded structured entry
  -> optional debug sink prints text
  -> UI/diagnostics reads health snapshot
```

Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.

## Snapshot Contract

`HealthSnapshot` should answer:

- overall health
- subsystem health states
- active warnings
- recent important logs
- key counters
- key timing summaries
- degraded-state reasons

The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.

## Migration Plan

### Step 1. Expand Health Model Types

Add structured subsystem/severity/category types and snapshot models.

Initial target:

- keep existing health fields
- add structured warning/log/counter/gauge containers
- add tests for bounded retention and deduplication

### Step 2. Wrap Direct Warning Paths

Route common direct logs through telemetry first.

Initial candidates:

- backend fallback warnings
- screenshot write failures
- OSC decode/dispatch failures
- render-thread request failures

### Step 3. Add Subsystem Health States

Let subsystems report state transitions.

Initial target:

- `RenderEngine`: healthy/degraded on render-thread request failures
- `VideoBackend`: configured/running/degraded/no-input/dropping
- `ControlServices`: running/degraded/stopped
- `Persistence`: clean/pending/error

### Step 4. Split Timing Into Named Metrics

Move from broad timing fields to named samples/gauges.

Initial target:

- render duration
- readback duration/fallback count
- output request latency
- playout completion interval
- event queue depth
- persistence write duration

### Step 5. Publish Health Snapshot

Expose `HealthTelemetry` snapshot through control/runtime presentation.

Initial target:

- UI can distinguish runtime state from operational health
- active warnings are visible
- recent degraded reasons are visible

### Step 6. Add Operational Tests

Cover:

- warning raise/clear
- repeated warning coalescing
- counter/gauge updates
- health derivation
- bounded log retention
- snapshot stability

## Testing Strategy

Recommended tests:

- warning raised appears in active warnings
- warning clear removes active warning but preserves history
- repeated warning increments count and updates last-seen time
- bounded log keeps newest entries
- subsystem `Error` makes overall health `Error`
- subsystem `Degraded` makes overall health degraded if no error exists
- timing sample updates summary
- counter delta accumulates
- health snapshot is read-only/stable

Useful homes:

- `HealthTelemetryTests`
- `RuntimeEventTypeTests` for observation event payloads
- future integration tests for control-service health publication

## Risks

### Telemetry Becomes Behavior

Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.

### Too Much Hot-Path Cost

Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.

### String-Only Logging

Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.

### Snapshot Bloat

Health snapshots should summarize operational state, not duplicate full runtime/project state.

### Alert Noise

Without deduplication and severity discipline, operator-facing health can become noisy and ignored.

## Phase 8 Exit Criteria

Phase 8 can be considered complete once the project can say:

- [ ] major subsystems publish structured health/telemetry observations
- [ ] active warnings and recent logs are structured and bounded
- [ ] subsystem health states roll up to an overall health state
- [ ] render/backend/control/persistence timing metrics are named and visible
- [ ] direct debug-string warning paths are wrapped or retired for major cases
- [ ] UI/control diagnostics can consume a stable health snapshot
- [ ] telemetry write paths are cheap enough for render/callback use
- [ ] telemetry behavior has focused tests

## Open Questions

- Should debug output remain enabled by default as a telemetry sink?
- How many recent logs/warnings should be retained in memory?
- Should timing summaries store raw samples, rolling windows, or both?
- Should warning thresholds be declared centrally or owned by each subsystem?
- Should health snapshots be published with runtime state or on a separate endpoint/channel?
- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?

## Short Version

Phase 8 should make the app diagnosable.

Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.