368 lines
10 KiB
Markdown
368 lines
10 KiB
Markdown
# Phase 8 Design: Health, Telemetry, And Operational Reporting
|
|
|
|
This document expands Phase 8 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.
|
|
|
|
Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
|
|
|
|
## Status
|
|
|
|
- Phase 8 design package: proposed.
|
|
- Phase 8 implementation: not started.
|
|
- Current alignment: `HealthTelemetry` exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
|
|
|
|
Current telemetry footholds:
|
|
|
|
- `HealthTelemetry` owns basic signal, performance, frame pacing, and video IO status reporting.
|
|
- `RuntimeEventDispatcher` publishes typed observations such as timing samples and backend state changes.
|
|
- `RuntimeStatePresenter` includes some health/performance fields in runtime-state output.
|
|
- Render and backend paths already collect some timing and late/drop counts.
|
|
|
|
## Why Phase 8 Exists
|
|
|
|
The app can detect many problems, but operational visibility is still fragmented:
|
|
|
|
- some failures show modal dialogs
|
|
- some warnings go only to `OutputDebugStringA`
|
|
- some timing lives in health telemetry
|
|
- some observations are runtime events
|
|
- UI-facing state combines operational state with runtime state
|
|
- repeated warnings are not uniformly deduplicated, classified, or summarized
|
|
|
|
Live software needs to answer:
|
|
|
|
- what is healthy right now?
|
|
- what is degraded but still running?
|
|
- what recently failed?
|
|
- which subsystem is under timing pressure?
|
|
- what should an operator see versus what should an engineer debug?
|
|
|
|
## Goals
|
|
|
|
Phase 8 should establish:
|
|
|
|
- structured log entries with subsystem, severity, category, timestamp, and message
|
|
- subsystem-scoped health states
|
|
- bounded recent warning/error history
|
|
- timing samples, counters, and gauges for render/control/backend/persistence
|
|
- stable health snapshots for UI/diagnostics
|
|
- direct debug-output paths wrapped by structured telemetry
|
|
- low-overhead reporting from render and callback paths
|
|
- tests for severity, deduplication, counters, snapshots, and bounded retention
|
|
|
|
## Non-Goals
|
|
|
|
Phase 8 should not require:
|
|
|
|
- a cloud telemetry service
|
|
- external metrics database
|
|
- a full UI redesign
|
|
- automatic recovery policy owned by telemetry
|
|
- unbounded logs or time-series storage
|
|
- replacing every `MessageBoxA` on day one
|
|
|
|
Telemetry observes and reports. It does not become the control plane.
|
|
|
|
## Target Model
|
|
|
|
Suggested core model:
|
|
|
|
- `TelemetrySubsystem`
|
|
- `TelemetrySeverity`
|
|
- `TelemetryLogEntry`
|
|
- `TelemetryWarningRecord`
|
|
- `TelemetryCounter`
|
|
- `TelemetryGauge`
|
|
- `TelemetryTimingSample`
|
|
- `SubsystemHealthState`
|
|
- `HealthSnapshot`
|
|
|
|
Important distinction:
|
|
|
|
- raw observations are append/update operations
|
|
- health snapshots are derived read models
|
|
|
|
## Health Domains
|
|
|
|
At minimum:
|
|
|
|
- `ApplicationShell`
|
|
- `RuntimeStore`
|
|
- `RuntimeCoordinator`
|
|
- `RuntimeSnapshotProvider`
|
|
- `ControlServices`
|
|
- `RenderEngine`
|
|
- `VideoBackend`
|
|
- `Persistence`
|
|
|
|
Suggested states:
|
|
|
|
- `Healthy`
|
|
- `Warning`
|
|
- `Degraded`
|
|
- `Error`
|
|
- `Unavailable`
|
|
|
|
The overall app health should be derived from subsystem states.
|
|
|
|
## Proposed Interfaces
|
|
|
|
### Write Interface
|
|
|
|
Target operations:
|
|
|
|
- `AppendLog(...)`
|
|
- `RaiseWarning(...)`
|
|
- `ClearWarning(...)`
|
|
- `RecordCounterDelta(...)`
|
|
- `RecordGauge(...)`
|
|
- `RecordTimingSample(...)`
|
|
- `ReportSubsystemState(...)`
|
|
|
|
Hot-path producers should be able to record observations cheaply and return.
|
|
|
|
### Read Interface
|
|
|
|
Target operations:
|
|
|
|
- `BuildHealthSnapshot()`
|
|
- `GetSubsystemHealth(...)`
|
|
- `GetActiveWarnings()`
|
|
- `GetRecentLogs(...)`
|
|
- `GetTimingSummary(...)`
|
|
|
|
UI/control services should consume snapshots, not scrape subsystem internals.
|
|
|
|
## Producer Expectations
|
|
|
|
### `RenderEngine`
|
|
|
|
Expected observations:
|
|
|
|
- render frame duration
|
|
- input upload duration/count/drop/coalescing
|
|
- output request latency
|
|
- readback duration
|
|
- synchronous readback fallback count
|
|
- preview present cost/skips
|
|
- wrong-thread diagnostics
|
|
|
|
### `VideoBackend`
|
|
|
|
Expected observations:
|
|
|
|
- lifecycle state
|
|
- playout queue depth
|
|
- output underruns
|
|
- late/dropped/flushed/completed counts
|
|
- input signal state
|
|
- output model/mode status
|
|
- spare buffer depth
|
|
|
|
### `ControlServices`
|
|
|
|
Expected observations:
|
|
|
|
- OSC decode errors
|
|
- control request failures
|
|
- websocket broadcast failures
|
|
- ingress queue depth
|
|
- file-watch/reload events
|
|
- service start/stop state
|
|
|
|
### `RuntimeCoordinator`
|
|
|
|
Expected observations:
|
|
|
|
- rejected mutation count and reasons
|
|
- reload requests
|
|
- preset failures
|
|
- transient-state invalidations
|
|
- persistence request publication
|
|
|
|
### `RuntimeSnapshotProvider`
|
|
|
|
Expected observations:
|
|
|
|
- snapshot publish duration
|
|
- snapshot version churn
|
|
- stale snapshot/fallback behavior
|
|
- publish failures
|
|
|
|
### `PersistenceWriter`
|
|
|
|
Expected observations:
|
|
|
|
- pending write count
|
|
- coalesced write count
|
|
- write duration
|
|
- write failure
|
|
- unsaved durable changes
|
|
- shutdown flush result
|
|
|
|
## Logging Policy
|
|
|
|
Direct string logging can remain as an output sink, but not as the source of truth.
|
|
|
|
Target flow:
|
|
|
|
```text
|
|
subsystem reports structured warning/log
|
|
-> HealthTelemetry stores bounded structured entry
|
|
-> optional debug sink prints text
|
|
-> UI/diagnostics reads health snapshot
|
|
```
|
|
|
|
Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
|
|
|
|
## Snapshot Contract
|
|
|
|
`HealthSnapshot` should answer:
|
|
|
|
- overall health
|
|
- subsystem health states
|
|
- active warnings
|
|
- recent important logs
|
|
- key counters
|
|
- key timing summaries
|
|
- degraded-state reasons
|
|
|
|
The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by `ControlServices`, but they should remain separate read models.
|
|
|
|
## Migration Plan
|
|
|
|
### Step 1. Expand Health Model Types
|
|
|
|
Add structured subsystem/severity/category types and snapshot models.
|
|
|
|
Initial target:
|
|
|
|
- keep existing health fields
|
|
- add structured warning/log/counter/gauge containers
|
|
- add tests for bounded retention and deduplication
|
|
|
|
### Step 2. Wrap Direct Warning Paths
|
|
|
|
Route common direct logs through telemetry first.
|
|
|
|
Initial candidates:
|
|
|
|
- backend fallback warnings
|
|
- screenshot write failures
|
|
- OSC decode/dispatch failures
|
|
- render-thread request failures
|
|
|
|
### Step 3. Add Subsystem Health States
|
|
|
|
Let subsystems report state transitions.
|
|
|
|
Initial target:
|
|
|
|
- `RenderEngine`: healthy/degraded on render-thread request failures
|
|
- `VideoBackend`: configured/running/degraded/no-input/dropping
|
|
- `ControlServices`: running/degraded/stopped
|
|
- `Persistence`: clean/pending/error
|
|
|
|
### Step 4. Split Timing Into Named Metrics
|
|
|
|
Move from broad timing fields to named samples/gauges.
|
|
|
|
Initial target:
|
|
|
|
- render duration
|
|
- readback duration/fallback count
|
|
- output request latency
|
|
- playout completion interval
|
|
- event queue depth
|
|
- persistence write duration
|
|
|
|
### Step 5. Publish Health Snapshot
|
|
|
|
Expose `HealthTelemetry` snapshot through control/runtime presentation.
|
|
|
|
Initial target:
|
|
|
|
- UI can distinguish runtime state from operational health
|
|
- active warnings are visible
|
|
- recent degraded reasons are visible
|
|
|
|
### Step 6. Add Operational Tests
|
|
|
|
Cover:
|
|
|
|
- warning raise/clear
|
|
- repeated warning coalescing
|
|
- counter/gauge updates
|
|
- health derivation
|
|
- bounded log retention
|
|
- snapshot stability
|
|
|
|
## Testing Strategy
|
|
|
|
Recommended tests:
|
|
|
|
- warning raised appears in active warnings
|
|
- warning clear removes active warning but preserves history
|
|
- repeated warning increments count and updates last-seen time
|
|
- bounded log keeps newest entries
|
|
- subsystem `Error` makes overall health `Error`
|
|
- subsystem `Degraded` makes overall health degraded if no error exists
|
|
- timing sample updates summary
|
|
- counter delta accumulates
|
|
- health snapshot is read-only/stable
|
|
|
|
Useful homes:
|
|
|
|
- `HealthTelemetryTests`
|
|
- `RuntimeEventTypeTests` for observation event payloads
|
|
- future integration tests for control-service health publication
|
|
|
|
## Risks
|
|
|
|
### Telemetry Becomes Behavior
|
|
|
|
Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
|
|
|
|
### Too Much Hot-Path Cost
|
|
|
|
Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
|
|
|
|
### String-Only Logging
|
|
|
|
Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
|
|
|
|
### Snapshot Bloat
|
|
|
|
Health snapshots should summarize operational state, not duplicate full runtime/project state.
|
|
|
|
### Alert Noise
|
|
|
|
Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
|
|
|
|
## Phase 8 Exit Criteria
|
|
|
|
Phase 8 can be considered complete once the project can say:
|
|
|
|
- [ ] major subsystems publish structured health/telemetry observations
|
|
- [ ] active warnings and recent logs are structured and bounded
|
|
- [ ] subsystem health states roll up to an overall health state
|
|
- [ ] render/backend/control/persistence timing metrics are named and visible
|
|
- [ ] direct debug-string warning paths are wrapped or retired for major cases
|
|
- [ ] UI/control diagnostics can consume a stable health snapshot
|
|
- [ ] telemetry write paths are cheap enough for render/callback use
|
|
- [ ] telemetry behavior has focused tests
|
|
|
|
## Open Questions
|
|
|
|
- Should debug output remain enabled by default as a telemetry sink?
|
|
- How many recent logs/warnings should be retained in memory?
|
|
- Should timing summaries store raw samples, rolling windows, or both?
|
|
- Should warning thresholds be declared centrally or owned by each subsystem?
|
|
- Should health snapshots be published with runtime state or on a separate endpoint/channel?
|
|
- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
|
|
|
|
## Short Version
|
|
|
|
Phase 8 should make the app diagnosable.
|
|
|
|
Subsystems report structured observations. `HealthTelemetry` records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.
|