Files
video-shader-toys/docs/subsystems/HealthTelemetry.md
Aiden 9e3412712c
All checks were successful
CI / React UI Build (push) Successful in 11s
CI / Native Windows Build And Tests (push) Successful in 2m52s
CI / Windows Release Package (push) Successful in 3m0s
Improvement
2026-05-12 00:00:23 +10:00

648 lines
22 KiB
Markdown

# HealthTelemetry Subsystem Design
This document expands the `HealthTelemetry` subsystem introduced in [PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md).
`HealthTelemetry` is the subsystem that owns operational visibility for the app. Its purpose is to gather health state, warnings, counters, logs, and timing observations from the other subsystems and publish them in a structured way without becoming a second control plane.
Before the Phase 1 runtime split, those responsibilities were fragmented across `RuntimeHost` status setters, ad hoc `OutputDebugStringA` calls, callback-local warnings, and UI-facing runtime-state payloads. The result was that the app could often detect problems, but did not yet have one clear place that answered:
- what is healthy right now
- what is degraded right now
- what has recently gone wrong
- which subsystem is under pressure
- how timing behavior is trending over time
`HealthTelemetry` is the target boundary that should answer those questions.
## Why This Subsystem Exists
The codebase already contains meaningful health and timing signals, but some are still spread through unrelated ownership domains:
- previous `RuntimeHost` status fields stored signal and timing status:
- `RuntimeHost.h`
- `RuntimeHost.cpp`
- `RuntimeHost.cpp`
- `RuntimeHost.cpp`
- render and bridge code historically reported timing by writing back into `RuntimeHost`:
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:50)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:49)
- backend warning paths still log directly:
- [DeckLinkFrameTransfer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkFrameTransfer.cpp:84)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:305)
- control ingress failures still log directly:
- [OscServer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/OscServer.cpp:142)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:100)
This creates several recurring problems:
- health information shares storage and lock scope with runtime state
- warnings are not consistently classified by subsystem or severity
- timing data is hard to compare across render, control, and backend paths
- UI connection state and operational state are too closely coupled
- logging is mostly text-first instead of structured-first
- recovery behavior is hard to audit because the app does not retain a coherent health snapshot
`HealthTelemetry` exists so timing and health concerns have one subsystem whose only job is observation and reporting, instead of drifting back into runtime storage, callback-local logging, or UI payload assembly.
## Design Goals
`HealthTelemetry` should optimize for:
- one authoritative home for operational visibility
- structured health state per subsystem
- timing and counter recording that does not require a UI to be connected
- low-friction reporting from render, backend, coordinator, and services
- explicit degraded-mode reporting instead of only raw text logs
- support for live operator summaries and deeper engineering diagnostics
- minimal risk of telemetry writes becoming a render or callback bottleneck
## Responsibilities
`HealthTelemetry` owns structured operational visibility.
Primary responsibilities:
- accept timing samples from major subsystems
- accept counter deltas and point-in-time gauges
- accept warning, error, and degraded-state transitions
- collect subsystem-scoped health state
- collect operator-visible summary state
- collect structured log entries
- build stable health snapshots for UI, diagnostics, and later persistence/export if desired
- retain recent history needed for short-term troubleshooting
- classify observations by subsystem, severity, and category
Secondary responsibilities that still fit here:
- smoothing or rolling-window summaries for timing metrics
- mapping raw subsystem observations into operator-facing health summaries
- deduplicating repeated warnings
- tracking warning open/clear lifecycles
- providing bounded in-memory history for recent logs and warning transitions
## Explicit Non-Responsibilities
`HealthTelemetry` should not become a behavior owner.
It does not own:
- layer stack truth
- persistence policy
- render scheduling
- DeckLink scheduling
- OSC buffering or routing
- reload coordination
- shader compilation
- recovery actions themselves
It also should not decide:
- whether render should skip a frame
- whether VideoBackend should increase queue depth
- whether RuntimeCoordinator should reject a mutation
- whether ControlServices should drop or coalesce ingress traffic
Those decisions belong to the subsystem being observed. `HealthTelemetry` may describe that a subsystem is degraded, but it must not quietly become the mechanism that tells the app how to react.
## Ownership Boundaries
`HealthTelemetry` owns the following state categories.
### Structured Log State
Examples:
- subsystem name
- severity
- category
- timestamp
- message
- optional structured fields such as layer id, preset name, queue depth, or shader id
This replaces the idea that `OutputDebugStringA` text is itself the main diagnostic product.
### Warning And Error State
Examples:
- active warning set
- warning occurrence counts
- first-seen and last-seen timestamps
- clear timestamps
- subsystem-scoped degraded flags
This is the durable in-memory operational state that should answer "what is currently wrong?" even if no UI was connected when the warning was raised.
### Timing State
Examples:
- render duration
- frame budget
- playout completion interval
- smoothed completion interval
- queue depth
- input upload skip count
- async readback fallback count
- control ingress lag or queue depth
- snapshot publication cost
This state should be organized as time-series-like rolling telemetry, not as a grab bag of unrelated `double` fields mixed into the runtime store.
### Health Snapshot State
Examples:
- current subsystem health summaries
- current operator-facing overall health summary
- most recent warning list
- recent counters and timing summaries
- "degraded but still running" status
This is the material that `ControlServices` or a diagnostics endpoint may later publish.
## State Model
The subsystem should model health and telemetry in a way that supports both machine-friendly and operator-friendly views.
Suggested conceptual model:
- `TelemetryLogEntry`
- `TelemetryWarningRecord`
- `TelemetryCounterState`
- `TelemetryGaugeState`
- `TelemetryTimingSeries`
- `SubsystemHealthState`
- `HealthSnapshot`
Important distinction:
- raw observations are append/update operations
- health snapshots are derived read models
That distinction matters because the system should be able to retain richer recent telemetry internally than what is necessarily sent to the UI on every refresh.
## Subsystem Health Domains
`HealthTelemetry` should track health by subsystem rather than as one flat status blob.
At minimum, Phase 1 should assume domains for:
- `RuntimeStore`
- `RuntimeCoordinator`
- `RuntimeSnapshotProvider`
- `ControlServices`
- `RenderEngine`
- `VideoBackend`
Optional cross-cutting domain:
- `ApplicationShell`
Each domain should be able to express states such as:
- `Healthy`
- `Warning`
- `Degraded`
- `Error`
- `Unavailable`
The exact enum can change, but the design should preserve the idea that each subsystem reports into its own health lane first, and only then is an overall status derived.
## Logging Boundaries
Logging belongs here, but logging should be structured-first.
Expected inputs:
- subsystem-scoped debug information
- warning and error messages
- recovery events
- notable state transitions
- significant operator actions that matter for diagnostics
Expected design rules:
- textual messages are still useful, but they should be wrapped in a structured log entry
- repeated transient failures should be rate-limited or deduplicated at the telemetry layer where possible
- log storage should be bounded in memory
- UI publication should read from health/log snapshots, not scrape stdout/debug output
Examples of current direct log paths that should eventually move behind `HealthTelemetry`:
- OSC decode/dispatch failures
- screenshot write failures
- DeckLink fallback warnings
- late/dropped frame warnings
## Metrics And Timing Boundaries
Timing and metrics should also move here, but their ownership line matters.
`HealthTelemetry` should own:
- metric collection interfaces
- rolling summaries
- recent history buffers
- warning thresholds if the app later chooses to define them declaratively
- operator-facing derived summaries
The producing subsystem should still own:
- the meaning of the measurement
- when it is sampled
- whether it triggers local mitigation
Examples:
- `RenderEngine` owns when render duration is sampled
- `VideoBackend` owns when queue depth or playout lateness is sampled
- `ControlServices` owns when ingress backlog is sampled
- `RuntimeSnapshotProvider` owns when snapshot publish/build timing is sampled
`HealthTelemetry` should not invent those timings by inference. It records them when producers report them.
## Proposed Interfaces
These are target-shape interfaces, not final signatures.
### Write/Record Interface
Core write-side operations could look like:
```cpp
enum class TelemetrySeverity;
enum class TelemetrySubsystem;
struct TelemetryLogEntry;
struct TelemetryWarning;
struct TelemetryTimingSample;
struct TelemetryCounterDelta;
struct TelemetryGaugeUpdate;
class IHealthTelemetry
{
public:
virtual void AppendLogEntry(const TelemetryLogEntry& entry) = 0;
virtual void RaiseWarning(const TelemetryWarning& warning) = 0;
virtual void ClearWarning(std::string_view warningKey) = 0;
virtual void RecordTimingSample(const TelemetryTimingSample& sample) = 0;
virtual void RecordCounterDelta(const TelemetryCounterDelta& delta) = 0;
virtual void RecordGauge(const TelemetryGaugeUpdate& gauge) = 0;
virtual void ReportSubsystemState(TelemetrySubsystem subsystem,
SubsystemHealthState state) = 0;
};
```
The key is that every subsystem should be able to publish observations without also needing to know how UI payloads, rolling summaries, or log retention are implemented.
### Read Interface
Expected read-side operations:
- `BuildHealthSnapshot()`
- `GetSubsystemHealth(...)`
- `GetRecentLogs(...)`
- `GetActiveWarnings()`
- `GetRecentTimingSummary(...)`
Design notes:
- the read interface should return stable snapshots or read models
- UI/websocket publication should consume those snapshots through `ControlServices`
- read-side access should not require direct knowledge of internal ring buffers or lock layout
## Producer Expectations By Subsystem
The parent Phase 1 design already allows multiple subsystems to publish into telemetry. This section makes that concrete.
### From `RuntimeCoordinator`
Expected observations:
- mutation rejected
- reload requested
- preset apply failed
- transient state cleared due to compatibility rules
- policy-driven degraded notices such as repeated invalid external control input
### From `RuntimeSnapshotProvider`
Expected observations:
- snapshot publication duration
- snapshot build failure
- snapshot version churn metrics
- repeated publish retries or stale-snapshot conditions
### From `ControlServices`
Expected observations:
- OSC decode failures
- websocket broadcast failures
- REST/control transport errors
- ingress queue depth
- coalescing/drop counts
- file-watch reload request activity
### From `RenderEngine`
Expected observations:
- frame render duration
- upload duration
- readback duration
- fallback to synchronous readback
- preview present timing
- render-local state resets caused by reload or incompatibility
### From `VideoBackend`
Expected observations:
- current playout queue depth
- system-memory playout frame counts by state: free, ready, and scheduled
- system-memory playout underrun, repeat, and drop counters
- system-memory frame age at schedule and completion time
- input signal state
- late frames
- dropped frames
- backend mode changes
- fallback from 10-bit to 8-bit input
- output-only black-frame mode
## Current Code Mapping
The current codebase already contains several telemetry responsibilities that should migrate here.
### Previous `RuntimeHost` Status Setters
These were the clearest initial migration candidates:
- `SetSignalStatus(...)`
- `TrySetSignalStatus(...)`
- `SetPerformanceStats(...)`
- `TrySetPerformanceStats(...)`
- `SetFramePacingStats(...)`
- `TrySetFramePacingStats(...)`
See:
- `RuntimeHost.h`
- `RuntimeHost.cpp`
- `RuntimeHost.cpp`
- `RuntimeHost.cpp`
In the target architecture, this kind of state should not sit on the same object that owns persistent layer truth.
### Render Timing Production
Current render timing is produced in:
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:50)
That timing sample should conceptually become:
- `RenderEngine -> HealthTelemetry::RecordTimingSample(...)`
not the old pattern:
- `RenderEngine -> RuntimeHost::TrySetPerformanceStats(...)`
### Playout And Signal Status Production
Current signal and frame pacing updates are produced in:
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:49)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:61)
These should eventually become structured `VideoBackend` observations instead of bridge-to-host status writes.
### Direct Warning And Log Paths
Current examples:
- late/dropped frame warnings:
- [DeckLinkFrameTransfer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkFrameTransfer.cpp:84)
- backend fallback warnings:
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:305)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:320)
- OSC errors:
- [OscServer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/OscServer.cpp:142)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:100)
All of these are clear migration candidates for `AppendLogEntry(...)`, `RaiseWarning(...)`, or counter/timing updates.
## Health Snapshot Contract
`HealthTelemetry` should expose one coherent health snapshot that other publication layers can consume.
That snapshot should be able to answer, at minimum:
- what the overall app health is
- whether input signal is present
- whether playout is healthy, degraded, or underrunning
- whether render timing is within budget
- what active warnings exist
- what recent notable events occurred
- what the current subsystem-specific states are
The important boundary is:
- `HealthTelemetry` builds the health snapshot
- `ControlServices` may publish it
- UI consumes it
That avoids rebuilding health summaries ad hoc in UI-facing runtime state serializers.
## Concurrency Expectations
This subsystem will likely receive updates from multiple threads:
- control ingress threads
- render thread
- backend callback threads
- coordinator/service threads
So the design should assume:
- low-contention write paths
- bounded memory
- no long-held global mutex that callbacks and render both depend on
Phase 1 does not require lock-free implementation, but it does require the architecture to avoid recreating the old problem where health writes share the same lock as durable state and render-facing concerns.
Practical expectations:
- per-domain aggregation or lightweight internal locking is acceptable
- read snapshots should be cheap and stable
- callback paths should record telemetry cheaply and return
## Migration Plan From Current Code
The safest migration path is to peel telemetry responsibilities away from the existing classes incrementally.
### Step 1: Introduce The `HealthTelemetry` Interface
Create a small interface and health model types first.
Initial responsibilities:
- append structured logs
- record timing samples
- record counter deltas
- raise and clear warnings
- build a read-only health snapshot
The first implementation can still be backed by simple in-memory structures.
### Step 2: Keep New Observations Off Runtime Storage
Route new health-style work into `HealthTelemetry` instead of adding more status fields to runtime storage.
This prevents the old status surface from growing during migration.
### Step 3: Replace Legacy Status Setters With Telemetry Producers
Refactor:
- render timing writes
- signal status writes
- playout pacing writes
so they publish structured observations instead of mutating store-adjacent fields.
### Step 4: Replace Direct `OutputDebugStringA` Warning Paths
Wrap common warning/error cases in telemetry producers.
This includes:
- OSC decode/dispatch failures
- DeckLink late/dropped frame notifications
- backend fallback notices
- screenshot write failures
Direct debug output can remain as a sink of telemetry if desired, but not as the primary source of truth.
### Step 5: Publish Health Snapshot Through UI/Diagnostics Paths
Once the snapshot format exists, let `ControlServices` publish health summaries and recent warnings explicitly rather than depending on the runtime-state payload alone.
## Risks
### 1. Telemetry becomes a hidden behavior controller
If warning states start being used as the indirect way subsystems tell each other what to do, the subsystem boundary will fail.
Guardrail:
- telemetry observes and reports
- it does not coordinate or command
### 2. Logging stays string-only
If the subsystem only centralizes text logging without structure, later diagnostics will still be difficult.
Guardrail:
- severity, subsystem, category, and optional fields should be first-class
### 3. Timing writes become too expensive
If every sample requires heavy locking or snapshot rebuilds, render and callback timing could regress.
Guardrail:
- cheap recording path
- derived summaries built separately from hot-path writes
### 4. Health snapshot duplicates runtime truth
If health snapshots start storing copies of durable runtime state, the subsystem boundary will blur again.
Guardrail:
- health snapshots summarize operational state
- they do not become a second runtime store
### 5. Warning severity semantics drift by subsystem
If each subsystem invents its own meaning for warning/degraded/error, operator visibility becomes noisy and inconsistent.
Guardrail:
- define shared severity and health-state vocabulary early
## Open Questions
### 1. Should debug-output sinks remain enabled by default?
Current recommendation:
- yes, as a sink fed by structured telemetry entries, not as the source of truth
### 2. How much timing history should be retained in memory?
Current recommendation:
- enough for short-term live troubleshooting and UI summaries
- not an unbounded time-series archive
### 3. Should operator-facing health and engineering diagnostics use the same snapshot?
Current recommendation:
- share one core telemetry model
- allow separate derived views for concise operator summaries versus deeper engineering detail
### 4. Where should threshold policy live if the app later formalizes warnings like "render over budget"?
Current recommendation:
- telemetry may evaluate declared thresholds
- subsystem owners still own mitigation behavior
### 5. Should input signal presence remain part of runtime state or move fully into telemetry?
Current recommendation:
- treat it as operational health state under `VideoBackend` reporting into telemetry
- avoid keeping it as a core durable runtime-store concern
## Success Criteria For This Subsystem
`HealthTelemetry` can be considered well-defined once the codebase can say, without ambiguity:
- all major subsystems have one place to publish timing, warnings, and counters
- health and timing state no longer share ownership with durable runtime state
- the UI can consume a stable health snapshot without scraping unrelated runtime fields
- direct debug-string warning paths are being retired or wrapped behind structured telemetry
- degraded-but-running conditions are visible as first-class state
## Short Version
`HealthTelemetry` is the subsystem that should answer:
- what is healthy right now
- what is degraded right now
- what recent warnings and errors occurred
- how render, control, and playout timing are behaving
It should:
- collect structured logs
- collect warnings and counters
- collect timing samples and gauges
- build stable health snapshots for publication
It should not:
- own core runtime truth
- decide app behavior
- coordinate recovery actions
- become a replacement for the render or backend policy layers
If this boundary holds, later phases can keep moving toward a much more diagnosable live system without putting timing and warning state back into runtime storage.