10 KiB
Phase 8 Design: Health, Telemetry, And Operational Reporting
This document expands Phase 8 of ARCHITECTURE_RESILIENCE_REVIEW.md into a concrete design target.
Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.
Status
- Phase 8 design package: proposed.
- Phase 8 implementation: not started.
- Current alignment:
HealthTelemetryexists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.
Current telemetry footholds:
HealthTelemetryowns basic signal, performance, frame pacing, and video IO status reporting.RuntimeEventDispatcherpublishes typed observations such as timing samples and backend state changes.RuntimeStatePresenterincludes some health/performance fields in runtime-state output.- Render and backend paths already collect some timing and late/drop counts.
Why Phase 8 Exists
The app can detect many problems, but operational visibility is still fragmented:
- some failures show modal dialogs
- some warnings go only to
OutputDebugStringA - some timing lives in health telemetry
- some observations are runtime events
- UI-facing state combines operational state with runtime state
- repeated warnings are not uniformly deduplicated, classified, or summarized
Live software needs to answer:
- what is healthy right now?
- what is degraded but still running?
- what recently failed?
- which subsystem is under timing pressure?
- what should an operator see versus what should an engineer debug?
Goals
Phase 8 should establish:
- structured log entries with subsystem, severity, category, timestamp, and message
- subsystem-scoped health states
- bounded recent warning/error history
- timing samples, counters, and gauges for render/control/backend/persistence
- stable health snapshots for UI/diagnostics
- direct debug-output paths wrapped by structured telemetry
- low-overhead reporting from render and callback paths
- tests for severity, deduplication, counters, snapshots, and bounded retention
Non-Goals
Phase 8 should not require:
- a cloud telemetry service
- external metrics database
- a full UI redesign
- automatic recovery policy owned by telemetry
- unbounded logs or time-series storage
- replacing every
MessageBoxAon day one
Telemetry observes and reports. It does not become the control plane.
Target Model
Suggested core model:
TelemetrySubsystemTelemetrySeverityTelemetryLogEntryTelemetryWarningRecordTelemetryCounterTelemetryGaugeTelemetryTimingSampleSubsystemHealthStateHealthSnapshot
Important distinction:
- raw observations are append/update operations
- health snapshots are derived read models
Health Domains
At minimum:
ApplicationShellRuntimeStoreRuntimeCoordinatorRuntimeSnapshotProviderControlServicesRenderEngineVideoBackendPersistence
Suggested states:
HealthyWarningDegradedErrorUnavailable
The overall app health should be derived from subsystem states.
Proposed Interfaces
Write Interface
Target operations:
AppendLog(...)RaiseWarning(...)ClearWarning(...)RecordCounterDelta(...)RecordGauge(...)RecordTimingSample(...)ReportSubsystemState(...)
Hot-path producers should be able to record observations cheaply and return.
Read Interface
Target operations:
BuildHealthSnapshot()GetSubsystemHealth(...)GetActiveWarnings()GetRecentLogs(...)GetTimingSummary(...)
UI/control services should consume snapshots, not scrape subsystem internals.
Producer Expectations
RenderEngine
Expected observations:
- render frame duration
- input upload duration/count/drop/coalescing
- output request latency
- readback duration
- synchronous readback fallback count
- preview present cost/skips
- wrong-thread diagnostics
VideoBackend
Expected observations:
- lifecycle state
- playout queue depth
- output underruns
- late/dropped/flushed/completed counts
- input signal state
- output model/mode status
- spare buffer depth
ControlServices
Expected observations:
- OSC decode errors
- control request failures
- websocket broadcast failures
- ingress queue depth
- file-watch/reload events
- service start/stop state
RuntimeCoordinator
Expected observations:
- rejected mutation count and reasons
- reload requests
- preset failures
- transient-state invalidations
- persistence request publication
RuntimeSnapshotProvider
Expected observations:
- snapshot publish duration
- snapshot version churn
- stale snapshot/fallback behavior
- publish failures
PersistenceWriter
Expected observations:
- pending write count
- coalesced write count
- write duration
- write failure
- unsaved durable changes
- shutdown flush result
Logging Policy
Direct string logging can remain as an output sink, but not as the source of truth.
Target flow:
subsystem reports structured warning/log
-> HealthTelemetry stores bounded structured entry
-> optional debug sink prints text
-> UI/diagnostics reads health snapshot
Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.
Snapshot Contract
HealthSnapshot should answer:
- overall health
- subsystem health states
- active warnings
- recent important logs
- key counters
- key timing summaries
- degraded-state reasons
The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by ControlServices, but they should remain separate read models.
Migration Plan
Step 1. Expand Health Model Types
Add structured subsystem/severity/category types and snapshot models.
Initial target:
- keep existing health fields
- add structured warning/log/counter/gauge containers
- add tests for bounded retention and deduplication
Step 2. Wrap Direct Warning Paths
Route common direct logs through telemetry first.
Initial candidates:
- backend fallback warnings
- screenshot write failures
- OSC decode/dispatch failures
- render-thread request failures
Step 3. Add Subsystem Health States
Let subsystems report state transitions.
Initial target:
RenderEngine: healthy/degraded on render-thread request failuresVideoBackend: configured/running/degraded/no-input/droppingControlServices: running/degraded/stoppedPersistence: clean/pending/error
Step 4. Split Timing Into Named Metrics
Move from broad timing fields to named samples/gauges.
Initial target:
- render duration
- readback duration/fallback count
- output request latency
- playout completion interval
- event queue depth
- persistence write duration
Step 5. Publish Health Snapshot
Expose HealthTelemetry snapshot through control/runtime presentation.
Initial target:
- UI can distinguish runtime state from operational health
- active warnings are visible
- recent degraded reasons are visible
Step 6. Add Operational Tests
Cover:
- warning raise/clear
- repeated warning coalescing
- counter/gauge updates
- health derivation
- bounded log retention
- snapshot stability
Testing Strategy
Recommended tests:
- warning raised appears in active warnings
- warning clear removes active warning but preserves history
- repeated warning increments count and updates last-seen time
- bounded log keeps newest entries
- subsystem
Errormakes overall healthError - subsystem
Degradedmakes overall health degraded if no error exists - timing sample updates summary
- counter delta accumulates
- health snapshot is read-only/stable
Useful homes:
HealthTelemetryTestsRuntimeEventTypeTestsfor observation event payloads- future integration tests for control-service health publication
Risks
Telemetry Becomes Behavior
Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.
Too Much Hot-Path Cost
Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.
String-Only Logging
Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.
Snapshot Bloat
Health snapshots should summarize operational state, not duplicate full runtime/project state.
Alert Noise
Without deduplication and severity discipline, operator-facing health can become noisy and ignored.
Phase 8 Exit Criteria
Phase 8 can be considered complete once the project can say:
- major subsystems publish structured health/telemetry observations
- active warnings and recent logs are structured and bounded
- subsystem health states roll up to an overall health state
- render/backend/control/persistence timing metrics are named and visible
- direct debug-string warning paths are wrapped or retired for major cases
- UI/control diagnostics can consume a stable health snapshot
- telemetry write paths are cheap enough for render/callback use
- telemetry behavior has focused tests
Open Questions
- Should debug output remain enabled by default as a telemetry sink?
- How many recent logs/warnings should be retained in memory?
- Should timing summaries store raw samples, rolling windows, or both?
- Should warning thresholds be declared centrally or owned by each subsystem?
- Should health snapshots be published with runtime state or on a separate endpoint/channel?
- Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?
Short Version
Phase 8 should make the app diagnosable.
Subsystems report structured observations. HealthTelemetry records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.