aiden/video-shader-toys

Fork 0

Files

Aiden e8a3805fff

CI / React UI Build (push) Successful in 10s

Details

CI / Native Windows Build And Tests (push) Successful in 2m40s

Details

CI / Windows Release Package (push) Has been cancelled

Details

Doc update again

2026-05-11 18:48:55 +10:00

10 KiB

Raw Blame History

Phase 8 Design: Health, Telemetry, And Operational Reporting

This document expands Phase 8 of ARCHITECTURE_RESILIENCE_REVIEW.md into a concrete design target.

Earlier phases clarify subsystem ownership, state layering, render-thread ownership, persistence, and backend lifecycle. Phase 8 should make operational visibility match that architecture: structured health state, timing, counters, warnings, and logs should flow through one telemetry subsystem instead of scattered debug strings and ad hoc status fields.

Status

Phase 8 design package: proposed.
Phase 8 implementation: not started.
Current alignment: HealthTelemetry exists and already receives some render, signal, video IO, and pacing observations. Runtime events also carry some timing and backend observations. The remaining work is to make health/telemetry structured, comprehensive, bounded, and operator-facing.

Current telemetry footholds:

HealthTelemetry owns basic signal, performance, frame pacing, and video IO status reporting.
RuntimeEventDispatcher publishes typed observations such as timing samples and backend state changes.
RuntimeStatePresenter includes some health/performance fields in runtime-state output.
Render and backend paths already collect some timing and late/drop counts.

Why Phase 8 Exists

The app can detect many problems, but operational visibility is still fragmented:

some failures show modal dialogs
some warnings go only to OutputDebugStringA
some timing lives in health telemetry
some observations are runtime events
UI-facing state combines operational state with runtime state
repeated warnings are not uniformly deduplicated, classified, or summarized

Live software needs to answer:

what is healthy right now?
what is degraded but still running?
what recently failed?
which subsystem is under timing pressure?
what should an operator see versus what should an engineer debug?

Goals

Phase 8 should establish:

structured log entries with subsystem, severity, category, timestamp, and message
subsystem-scoped health states
bounded recent warning/error history
timing samples, counters, and gauges for render/control/backend/persistence
stable health snapshots for UI/diagnostics
direct debug-output paths wrapped by structured telemetry
low-overhead reporting from render and callback paths
tests for severity, deduplication, counters, snapshots, and bounded retention

Non-Goals

Phase 8 should not require:

a cloud telemetry service
external metrics database
a full UI redesign
automatic recovery policy owned by telemetry
unbounded logs or time-series storage
replacing every MessageBoxA on day one

Telemetry observes and reports. It does not become the control plane.

Target Model

Suggested core model:

TelemetrySubsystem
TelemetrySeverity
TelemetryLogEntry
TelemetryWarningRecord
TelemetryCounter
TelemetryGauge
TelemetryTimingSample
SubsystemHealthState
HealthSnapshot

Important distinction:

raw observations are append/update operations
health snapshots are derived read models

Health Domains

At minimum:

ApplicationShell
RuntimeStore
RuntimeCoordinator
RuntimeSnapshotProvider
ControlServices
RenderEngine
VideoBackend
Persistence

Suggested states:

Healthy
Warning
Degraded
Error
Unavailable

The overall app health should be derived from subsystem states.

Proposed Interfaces

Write Interface

Target operations:

AppendLog(...)
RaiseWarning(...)
ClearWarning(...)
RecordCounterDelta(...)
RecordGauge(...)
RecordTimingSample(...)
ReportSubsystemState(...)

Hot-path producers should be able to record observations cheaply and return.

Read Interface

Target operations:

BuildHealthSnapshot()
GetSubsystemHealth(...)
GetActiveWarnings()
GetRecentLogs(...)
GetTimingSummary(...)

UI/control services should consume snapshots, not scrape subsystem internals.

Producer Expectations

`RenderEngine`

Expected observations:

render frame duration
input upload duration/count/drop/coalescing
output request latency
readback duration
synchronous readback fallback count
preview present cost/skips
wrong-thread diagnostics

`VideoBackend`

Expected observations:

lifecycle state
playout queue depth
output underruns
late/dropped/flushed/completed counts
input signal state
output model/mode status
spare buffer depth

`ControlServices`

Expected observations:

OSC decode errors
control request failures
websocket broadcast failures
ingress queue depth
file-watch/reload events
service start/stop state

`RuntimeCoordinator`

Expected observations:

rejected mutation count and reasons
reload requests
preset failures
transient-state invalidations
persistence request publication

`RuntimeSnapshotProvider`

Expected observations:

snapshot publish duration
snapshot version churn
stale snapshot/fallback behavior
publish failures

`PersistenceWriter`

Expected observations:

pending write count
coalesced write count
write duration
write failure
unsaved durable changes
shutdown flush result

Logging Policy

Direct string logging can remain as an output sink, but not as the source of truth.

Target flow:

subsystem reports structured warning/log
  -> HealthTelemetry stores bounded structured entry
  -> optional debug sink prints text
  -> UI/diagnostics reads health snapshot

Repeated warnings should be deduplicated by key while preserving counts and last-seen timestamps.

Snapshot Contract

HealthSnapshot should answer:

overall health
subsystem health states
active warnings
recent important logs
key counters
key timing summaries
degraded-state reasons

The snapshot should avoid copying durable runtime truth. Runtime state and health state can be published together by ControlServices, but they should remain separate read models.

Migration Plan

Step 1. Expand Health Model Types

Add structured subsystem/severity/category types and snapshot models.

Initial target:

keep existing health fields
add structured warning/log/counter/gauge containers
add tests for bounded retention and deduplication

Step 2. Wrap Direct Warning Paths

Route common direct logs through telemetry first.

Initial candidates:

backend fallback warnings
screenshot write failures
OSC decode/dispatch failures
render-thread request failures

Step 3. Add Subsystem Health States

Let subsystems report state transitions.

Initial target:

RenderEngine: healthy/degraded on render-thread request failures
VideoBackend: configured/running/degraded/no-input/dropping
ControlServices: running/degraded/stopped
Persistence: clean/pending/error

Step 4. Split Timing Into Named Metrics

Move from broad timing fields to named samples/gauges.

Initial target:

render duration
readback duration/fallback count
output request latency
playout completion interval
event queue depth
persistence write duration

Step 5. Publish Health Snapshot

Expose HealthTelemetry snapshot through control/runtime presentation.

Initial target:

UI can distinguish runtime state from operational health
active warnings are visible
recent degraded reasons are visible

Step 6. Add Operational Tests

Cover:

warning raise/clear
repeated warning coalescing
counter/gauge updates
health derivation
bounded log retention
snapshot stability

Testing Strategy

Recommended tests:

warning raised appears in active warnings
warning clear removes active warning but preserves history
repeated warning increments count and updates last-seen time
bounded log keeps newest entries
subsystem Error makes overall health Error
subsystem Degraded makes overall health degraded if no error exists
timing sample updates summary
counter delta accumulates
health snapshot is read-only/stable

Useful homes:

HealthTelemetryTests
RuntimeEventTypeTests for observation event payloads
future integration tests for control-service health publication

Risks

Telemetry Becomes Behavior

Telemetry must not become the hidden way subsystems command each other. It reports. Subsystems own mitigation.

Too Much Hot-Path Cost

Render and callback paths need cheap writes. Use bounded structures and avoid expensive formatting on hot paths.

String-Only Logging

Centralizing strings is not enough. Severity, subsystem, category, and structured fields should be first-class.

Snapshot Bloat

Health snapshots should summarize operational state, not duplicate full runtime/project state.

Alert Noise

Without deduplication and severity discipline, operator-facing health can become noisy and ignored.

Phase 8 Exit Criteria

Phase 8 can be considered complete once the project can say:

major subsystems publish structured health/telemetry observations
active warnings and recent logs are structured and bounded
subsystem health states roll up to an overall health state
render/backend/control/persistence timing metrics are named and visible
direct debug-string warning paths are wrapped or retired for major cases
UI/control diagnostics can consume a stable health snapshot
telemetry write paths are cheap enough for render/callback use
telemetry behavior has focused tests

Open Questions

Should debug output remain enabled by default as a telemetry sink?
How many recent logs/warnings should be retained in memory?
Should timing summaries store raw samples, rolling windows, or both?
Should warning thresholds be declared centrally or owned by each subsystem?
Should health snapshots be published with runtime state or on a separate endpoint/channel?
Should logs eventually be written to disk, and if so, through Phase 6 persistence infrastructure or a separate log sink?

Short Version

Phase 8 should make the app diagnosable.

Subsystems report structured observations. HealthTelemetry records bounded logs, warnings, counters, gauges, timing, and subsystem states. UI and diagnostics consume stable health snapshots. Debug strings become a sink, not the source of truth.

10 KiB Raw Blame History

Phase 8 Design: Health, Telemetry, And Operational Reporting

Status

Why Phase 8 Exists

Goals

Non-Goals

Target Model

Health Domains

Proposed Interfaces

Write Interface

Read Interface

Producer Expectations

RenderEngine

VideoBackend

ControlServices

RuntimeCoordinator

RuntimeSnapshotProvider

PersistenceWriter

Logging Policy

Snapshot Contract

Migration Plan

Step 1. Expand Health Model Types

Step 2. Wrap Direct Warning Paths

Step 3. Add Subsystem Health States

Step 4. Split Timing Into Named Metrics

Step 5. Publish Health Snapshot

Step 6. Add Operational Tests

Testing Strategy

Risks

Telemetry Becomes Behavior

Too Much Hot-Path Cost

String-Only Logging

Snapshot Bloat

Alert Noise

Phase 8 Exit Criteria

Open Questions

Short Version

10 KiB

Raw Blame History

`RenderEngine`

`VideoBackend`

`ControlServices`

`RuntimeCoordinator`

`RuntimeSnapshotProvider`

`PersistenceWriter`