Doc update again
This commit is contained in:
367
docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
Normal file
367
docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md
Normal file
@@ -0,0 +1,367 @@
|
||||
# Phase 8 Design: Health, Telemetry, And Operational Reporting
|
||||
|
||||
This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
|
||||
|
||||
Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
|
||||
|
||||
## Status
|
||||
|
||||
- Phase 8 design package: proposed.
|
||||
- Phase 8 implementation: not started.
|
||||
- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
|
||||
|
||||
Current telemetry footholds:
|
||||
|
||||
- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
|
||||
- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
|
||||
- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
|
||||
- Render and backend paths already collect some timing and late/drop counts.
|
||||
|
||||
## Why Phase 8 Exists
|
||||
|
||||
The app can detect many problems, but operational visibility is still fragmented:
|
||||
|
||||
- some failures show modal dialogs
|
||||
- some warnings go only to `OutputDebugStringA`
|
||||
- some timing lives in health telemetry
|
||||
- some observations are runtime events
|
||||
- UI-facing state combines operational state with runtime state
|
||||
- repeated warnings are not uniformly deduplicated, classified, or summarized
|
||||
|
||||
Live software needs to answer:
|
||||
|
||||
- what is healthy right now?
|
||||
- what is degraded but still running?
|
||||
- what recently failed?
|
||||
- which subsystem is under timing pressure?
|
||||
- what should an operator see versus what should an engineer debug?
|
||||
|
||||
## Goals
|
||||
|
||||
Phase 8 should establish:
|
||||
|
||||
- structured log entries with subsystem, severity, category, timestamp, and message
|
||||
- subsystem-scoped health states
|
||||
- bounded recent warning/error history
|
||||
- timing samples, counters, and gauges for render/control/backend/persistence
|
||||
- stable health snapshots for UI/diagnostics
|
||||
- direct debug-output paths wrapped by structured telemetry
|
||||
- low-overhead reporting from render and callback paths
|
||||
- tests for severity, deduplication, counters, snapshots, and bounded retention
|
||||
|
||||
## Non-Goals
|
||||
|
||||
Phase 8 should not require:
|
||||
|
||||
- a cloud telemetry service
|
||||
- external metrics database
|
||||
- a full UI redesign
|
||||
- automatic recovery policy owned by telemetry
|
||||
- unbounded logs or time-series storage
|
||||
- replacing every `MessageBoxA` on day one
|
||||
|
||||
Telemetry observes and reports. It does not become the control plane.
|
||||
|
||||
## Target Model
|
||||
|
||||
Suggested core model:
|
||||
|
||||
- `TelemetrySubsystem`
|
||||
- `TelemetrySeverity`
|
||||
- `TelemetryLogEntry`
|
||||
- `TelemetryWarningRecord`
|
||||
- `TelemetryCounter`
|
||||
- `TelemetryGauge`
|
||||
- `TelemetryTimingSample`
|
||||
- `SubsystemHealthState`
|
||||
- `HealthSnapshot`
|
||||
|
||||
Important distinction:
|
||||
|
||||
- raw observations are append/update operations
|
||||
- health snapshots are derived read models
|
||||
|
||||
## Health Domains
|
||||
|
||||
At minimum:
|
||||
|
||||
- `ApplicationShell`
|
||||
- `RuntimeStore`
|
||||
- `RuntimeCoordinator`
|
||||
- `RuntimeSnapshotProvider`
|
||||
- `ControlServices`
|
||||
- `RenderEngine`
|
||||
- `VideoBackend`
|
||||
- `Persistence`
|
||||
|
||||
Suggested states:
|
||||
|
||||
- `Healthy`
|
||||
- `Warning`
|
||||
- `Degraded`
|
||||
- `Error`
|
||||
- `Unavailable`
|
||||
|
||||
The overall app health should be derived from subsystem states.
|
||||
|
||||
## Proposed Interfaces
|
||||
|
||||
### Write Interface
|
||||
|
||||
Target operations:
|
||||
|
||||
- `AppendLog(...)`
|
||||
- `RaiseWarning(...)`
|
||||
- `ClearWarning(...)`
|
||||
- `RecordCounterDelta(...)`
|
||||
- `RecordGauge(...)`
|
||||
- `RecordTimingSample(...)`
|
||||
- `ReportSubsystemState(...)`
|
||||
|
||||
Hot-path producers should be able to record observations cheaply and return.
|
||||
|
||||
### Read Interface
|
||||
|
||||
Target operations:
|
||||
|
||||
- `BuildHealthSnapshot()`
|
||||
- `GetSubsystemHealth(...)`
|
||||
- `GetActiveWarnings()`
|
||||
- `GetRecentLogs(...)`
|
||||
- `GetTimingSummary(...)`
|
||||
|
||||
UI/control services should consume snapshots, not scrape subsystem internals.
|
||||
|
||||
## Producer Expectations
|
||||
|
||||
### `RenderEngine`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- render frame duration
|
||||
- input upload duration/count/drop/coalescing
|
||||
- output request latency
|
||||
- readback duration
|
||||
- synchronous readback fallback count
|
||||
- preview present cost/skips
|
||||
- wrong-thread diagnostics
|
||||
|
||||
### `VideoBackend`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- lifecycle state
|
||||
- playout queue depth
|
||||
- output underruns
|
||||
- late/dropped/flushed/completed counts
|
||||
- input signal state
|
||||
- output model/mode status
|
||||
- spare buffer depth
|
||||
|
||||
### `ControlServices`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- OSC decode errors
|
||||
- control request failures
|
||||
- websocket broadcast failures
|
||||
- ingress queue depth
|
||||
- file-watch/reload events
|
||||
- service start/stop state
|
||||
|
||||
### `RuntimeCoordinator`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- rejected mutation count and reasons
|
||||
- reload requests
|
||||
- preset failures
|
||||
- transient-state invalidations
|
||||
- persistence request publication
|
||||
|
||||
### `RuntimeSnapshotProvider`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- snapshot publish duration
|
||||
- snapshot version churn
|
||||
- stale snapshot/fallback behavior
|
||||
- publish failures
|
||||
|
||||
### `PersistenceWriter`
|
||||
|
||||
Expected observations:
|
||||
|
||||
- pending write count
|
||||
- coalesced write count
|
||||
- write duration
|
||||
- write failure
|
||||
- unsaved durable changes
|
||||
- shutdown flush result
|
||||
|
||||
## Logging Policy
|
||||
|
||||
Direct string logging can remain as an output sink, but not as the source of truth.
|
||||
|
||||
Target flow:
|
||||
|
||||
```text
|
||||
subsystem reports structured warning/log
|
||||
-> HealthTelemetry stores bounded structured entry
|
||||
-> optional debug sink prints text
|
||||
-> UI/diagnostics reads health snapshot
|
||||
```
|
||||
|
||||
Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
|
||||
|
||||
## Snapshot Contract
|
||||
|
||||
`HealthSnapshot` should answer:
|
||||
|
||||
- overall health
|
||||
- subsystem health states
|
||||
- active warnings
|
||||
- recent important logs
|
||||
- key counters
|
||||
- key timing summaries
|
||||
- degraded-state reasons
|
||||
|
||||
The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Step 1. Expand Health Model Types
|
||||
|
||||
Add structured subsystem/severity/category types and snapshot models.
|
||||
|
||||
Initial target:
|
||||
|
||||
- keep existing health fields
|
||||
- add structured warning/log/counter/gauge containers
|
||||
- add tests for bounded retention and deduplication
|
||||
|
||||
### Step 2. Wrap Direct Warning Paths
|
||||
|
||||
Route common direct logs through telemetry first.
|
||||
|
||||
Initial candidates:
|
||||
|
||||
- backend fallback warnings
|
||||
- screenshot write failures
|
||||
- OSC decode/dispatch failures
|
||||
- render-thread request failures
|
||||
|
||||
### Step 3. Add Subsystem Health States
|
||||
|
||||
Let subsystems report state transitions.
|
||||
|
||||
Initial target:
|
||||
|
||||
- `RenderEngine`: healthy/degraded on render-thread request failures
|
||||
- `VideoBackend`: configured/running/degraded/no-input/dropping
|
||||
- `ControlServices`: running/degraded/stopped
|
||||
- `Persistence`: clean/pending/error
|
||||
|
||||
### Step 4. Split Timing Into Named Metrics
|
||||
|
||||
Move from broad timing fields to named samples/gauges.
|
||||
|
||||
Initial target:
|
||||
|
||||
- render duration
|
||||
- readback duration/fallback count
|
||||
- output request latency
|
||||
- playout completion interval
|
||||
- event queue depth
|
||||
- persistence write duration
|
||||
|
||||
### Step 5. Publish Health Snapshot
|
||||
|
||||
Expose `HealthTelemetry` snapshot through control/runtime presentation.
|
||||
|
||||
Initial target:
|
||||
|
||||
- UI can distinguish runtime state from operational health
|
||||
- active warnings are visible
|
||||
- recent degraded reasons are visible
|
||||
|
||||
### Step 6. Add Operational Tests
|
||||
|
||||
Cover:
|
||||
|
||||
- warning raise/clear
|
||||
- repeated warning coalescing
|
||||
- counter/gauge updates
|
||||
- health derivation
|
||||
- bounded log retention
|
||||
- snapshot stability
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Recommended tests:
|
||||
|
||||
- warning raised appears in active warnings
|
||||
- warning clear removes active warning but preserves history
|
||||
- repeated warning increments count and updates last-seen time
|
||||
- bounded log keeps newest entries
|
||||
- subsystem `Error` makes overall health `Error`
|
||||
- subsystem `Degraded` makes overall health degraded if no error exists
|
||||
- timing sample updates summary
|
||||
- counter delta accumulates
|
||||
- health snapshot is read-only/stable
|
||||
|
||||
Useful homes:
|
||||
|
||||
- `HealthTelemetryTests`
|
||||
- `RuntimeEventTypeTests` for observation event payloads
|
||||
- future integration tests for control-service health publication
|
||||
|
||||
## Risks
|
||||
|
||||
### Telemetry Becomes Behavior
|
||||
|
||||
Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
|
||||
|
||||
### Too Much Hot-Path Cost
|
||||
|
||||
Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
|
||||
|
||||
### String-Only Logging
|
||||
|
||||
Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
|
||||
|
||||
### Snapshot Bloat
|
||||
|
||||
Health snapshots should summarize operational state, not duplicate full runtime/project state.
|
||||
|
||||
### Alert Noise
|
||||
|
||||
Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
|
||||
|
||||
## Phase 8 Exit Criteria
|
||||
|
||||
Phase 8 can be considered complete once the project can say:
|
||||
|
||||
- [ ] major subsystems publish structured health/telemetry observations
|
||||
- [ ] active warnings and recent logs are structured and bounded
|
||||
- [ ] subsystem health states roll up to an overall health state
|
||||
- [ ] render/backend/control/persistence timing metrics are named and visible
|
||||
- [ ] direct debug-string warning paths are wrapped or retired for major cases
|
||||
- [ ] UI/control diagnostics can consume a stable health snapshot
|
||||
- [ ] telemetry write paths are cheap enough for render/callback use
|
||||
- [ ] telemetry behavior has focused tests
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should debug output remain enabled by default as a telemetry sink?
|
||||
- How many recent logs/warnings should be retained in memory?
|
||||
- Should timing summaries store raw samples, rolling windows, or both?
|
||||
- Should warning thresholds be declared centrally or owned by each subsystem?
|
||||
- Should health snapshots be published with runtime state or on a separate endpoint/channel?
|
||||
- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
|
||||
|
||||
## Short Version
|
||||
|
||||
Phase 8 should make the app diagnosable.
|
||||
|
||||
Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.
|
||||
Reference in New Issue
Block a user