22 KiB
HealthTelemetry Subsystem Design
This document expands the HealthTelemetry subsystem introduced in PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md.
HealthTelemetry is the subsystem that owns operational visibility for the app. Its purpose is to gather health state, warnings, counters, logs, and timing observations from the other subsystems and publish them in a structured way without becoming a second control plane.
Today, those responsibilities are fragmented across RuntimeHost status setters, ad hoc OutputDebugStringA calls, callback-local warnings, and UI-facing runtime-state payloads. The result is that the app can often detect problems, but it does not yet have one clear place that answers:
- what is healthy right now
- what is degraded right now
- what has recently gone wrong
- which subsystem is under pressure
- how timing behavior is trending over time
HealthTelemetry is the target boundary that should answer those questions.
Why This Subsystem Exists
The current code already contains meaningful health and timing signals, but they are spread through unrelated ownership domains:
RuntimeHoststores signal and timing status:- render and bridge code report timing by writing back into
RuntimeHost: - backend warning paths still log directly:
- control ingress failures still log directly:
This creates several recurring problems:
- health information shares storage and lock scope with runtime state
- warnings are not consistently classified by subsystem or severity
- timing data is hard to compare across render, control, and backend paths
- UI connection state and operational state are too closely coupled
- logging is mostly text-first instead of structured-first
- recovery behavior is hard to audit because the app does not retain a coherent health snapshot
HealthTelemetry exists so later phases can move timing and health concerns out of RuntimeHost, out of callback-local logging, and into one subsystem whose only job is observation and reporting.
Design Goals
HealthTelemetry should optimize for:
- one authoritative home for operational visibility
- structured health state per subsystem
- timing and counter recording that does not require a UI to be connected
- low-friction reporting from render, backend, coordinator, and services
- explicit degraded-mode reporting instead of only raw text logs
- support for live operator summaries and deeper engineering diagnostics
- minimal risk of telemetry writes becoming a render or callback bottleneck
Responsibilities
HealthTelemetry owns structured operational visibility.
Primary responsibilities:
- accept timing samples from major subsystems
- accept counter deltas and point-in-time gauges
- accept warning, error, and degraded-state transitions
- collect subsystem-scoped health state
- collect operator-visible summary state
- collect structured log entries
- build stable health snapshots for UI, diagnostics, and later persistence/export if desired
- retain recent history needed for short-term troubleshooting
- classify observations by subsystem, severity, and category
Secondary responsibilities that still fit here:
- smoothing or rolling-window summaries for timing metrics
- mapping raw subsystem observations into operator-facing health summaries
- deduplicating repeated warnings
- tracking warning open/clear lifecycles
- providing bounded in-memory history for recent logs and warning transitions
Explicit Non-Responsibilities
HealthTelemetry should not become a behavior owner.
It does not own:
- layer stack truth
- persistence policy
- render scheduling
- DeckLink scheduling
- OSC buffering or routing
- reload coordination
- shader compilation
- recovery actions themselves
It also should not decide:
- whether render should skip a frame
- whether VideoBackend should increase queue depth
- whether RuntimeCoordinator should reject a mutation
- whether ControlServices should drop or coalesce ingress traffic
Those decisions belong to the subsystem being observed. HealthTelemetry may describe that a subsystem is degraded, but it must not quietly become the mechanism that tells the app how to react.
Ownership Boundaries
HealthTelemetry owns the following state categories.
Structured Log State
Examples:
- subsystem name
- severity
- category
- timestamp
- message
- optional structured fields such as layer id, preset name, queue depth, or shader id
This replaces the idea that OutputDebugStringA text is itself the main diagnostic product.
Warning And Error State
Examples:
- active warning set
- warning occurrence counts
- first-seen and last-seen timestamps
- clear timestamps
- subsystem-scoped degraded flags
This is the durable in-memory operational state that should answer "what is currently wrong?" even if no UI was connected when the warning was raised.
Timing State
Examples:
- render duration
- frame budget
- playout completion interval
- smoothed completion interval
- queue depth
- input upload skip count
- async readback fallback count
- control ingress lag or queue depth
- snapshot publication cost
This state should be organized as time-series-like rolling telemetry, not as a grab bag of unrelated double fields mixed into the runtime store.
Health Snapshot State
Examples:
- current subsystem health summaries
- current operator-facing overall health summary
- most recent warning list
- recent counters and timing summaries
- "degraded but still running" status
This is the material that ControlServices or a diagnostics endpoint may later publish.
State Model
The subsystem should model health and telemetry in a way that supports both machine-friendly and operator-friendly views.
Suggested conceptual model:
TelemetryLogEntryTelemetryWarningRecordTelemetryCounterStateTelemetryGaugeStateTelemetryTimingSeriesSubsystemHealthStateHealthSnapshot
Important distinction:
- raw observations are append/update operations
- health snapshots are derived read models
That distinction matters because the system should be able to retain richer recent telemetry internally than what is necessarily sent to the UI on every refresh.
Subsystem Health Domains
HealthTelemetry should track health by subsystem rather than as one flat status blob.
At minimum, Phase 1 should assume domains for:
RuntimeStoreRuntimeCoordinatorRuntimeSnapshotProviderControlServicesRenderEngineVideoBackend
Optional cross-cutting domain:
ApplicationShell
Each domain should be able to express states such as:
HealthyWarningDegradedErrorUnavailable
The exact enum can change, but the design should preserve the idea that each subsystem reports into its own health lane first, and only then is an overall status derived.
Logging Boundaries
Logging belongs here, but logging should be structured-first.
Expected inputs:
- subsystem-scoped debug information
- warning and error messages
- recovery events
- notable state transitions
- significant operator actions that matter for diagnostics
Expected design rules:
- textual messages are still useful, but they should be wrapped in a structured log entry
- repeated transient failures should be rate-limited or deduplicated at the telemetry layer where possible
- log storage should be bounded in memory
- UI publication should read from health/log snapshots, not scrape stdout/debug output
Examples of current direct log paths that should eventually move behind HealthTelemetry:
- OSC decode/dispatch failures
- screenshot write failures
- DeckLink fallback warnings
- late/dropped frame warnings
Metrics And Timing Boundaries
Timing and metrics should also move here, but their ownership line matters.
HealthTelemetry should own:
- metric collection interfaces
- rolling summaries
- recent history buffers
- warning thresholds if the app later chooses to define them declaratively
- operator-facing derived summaries
The producing subsystem should still own:
- the meaning of the measurement
- when it is sampled
- whether it triggers local mitigation
Examples:
RenderEngineowns when render duration is sampledVideoBackendowns when queue depth or playout lateness is sampledControlServicesowns when ingress backlog is sampledRuntimeSnapshotProviderowns when snapshot publish/build timing is sampled
HealthTelemetry should not invent those timings by inference. It records them when producers report them.
Proposed Interfaces
These are target-shape interfaces, not final signatures.
Write/Record Interface
Core write-side operations could look like:
enum class TelemetrySeverity;
enum class TelemetrySubsystem;
struct TelemetryLogEntry;
struct TelemetryWarning;
struct TelemetryTimingSample;
struct TelemetryCounterDelta;
struct TelemetryGaugeUpdate;
class IHealthTelemetry
{
public:
virtual void AppendLogEntry(const TelemetryLogEntry& entry) = 0;
virtual void RaiseWarning(const TelemetryWarning& warning) = 0;
virtual void ClearWarning(std::string_view warningKey) = 0;
virtual void RecordTimingSample(const TelemetryTimingSample& sample) = 0;
virtual void RecordCounterDelta(const TelemetryCounterDelta& delta) = 0;
virtual void RecordGauge(const TelemetryGaugeUpdate& gauge) = 0;
virtual void ReportSubsystemState(TelemetrySubsystem subsystem,
SubsystemHealthState state) = 0;
};
The key is that every subsystem should be able to publish observations without also needing to know how UI payloads, rolling summaries, or log retention are implemented.
Read Interface
Expected read-side operations:
BuildHealthSnapshot()GetSubsystemHealth(...)GetRecentLogs(...)GetActiveWarnings()GetRecentTimingSummary(...)
Design notes:
- the read interface should return stable snapshots or read models
- UI/websocket publication should consume those snapshots through
ControlServices - read-side access should not require direct knowledge of internal ring buffers or lock layout
Producer Expectations By Subsystem
The parent Phase 1 design already allows multiple subsystems to publish into telemetry. This section makes that concrete.
From RuntimeCoordinator
Expected observations:
- mutation rejected
- reload requested
- preset apply failed
- transient state cleared due to compatibility rules
- policy-driven degraded notices such as repeated invalid external control input
From RuntimeSnapshotProvider
Expected observations:
- snapshot publication duration
- snapshot build failure
- snapshot version churn metrics
- repeated publish retries or stale-snapshot conditions
From ControlServices
Expected observations:
- OSC decode failures
- websocket broadcast failures
- REST/control transport errors
- ingress queue depth
- coalescing/drop counts
- file-watch reload request activity
From RenderEngine
Expected observations:
- frame render duration
- upload duration
- readback duration
- fallback to synchronous readback
- preview present timing
- render-local state resets caused by reload or incompatibility
From VideoBackend
Expected observations:
- current playout queue depth
- input signal state
- late frames
- dropped frames
- backend mode changes
- fallback from 10-bit to 8-bit input
- output-only black-frame mode
Current Code Mapping
The current codebase already contains several telemetry responsibilities that should migrate here.
RuntimeHost Status Setters
These are the clearest existing candidates:
SetSignalStatus(...)TrySetSignalStatus(...)SetPerformanceStats(...)TrySetPerformanceStats(...)SetFramePacingStats(...)TrySetFramePacingStats(...)
See:
In the target architecture, this kind of state should no longer sit on the same object that owns persistent layer truth.
Render Timing Production
Current render timing is produced in:
That timing sample should conceptually become:
RenderEngine -> HealthTelemetry::RecordTimingSample(...)
not:
RenderEngine -> RuntimeHost::TrySetPerformanceStats(...)
Playout And Signal Status Production
Current signal and frame pacing updates are produced in:
These should eventually become structured VideoBackend observations instead of bridge-to-host status writes.
Direct Warning And Log Paths
Current examples:
- late/dropped frame warnings:
- backend fallback warnings:
- OSC errors:
All of these are clear migration candidates for AppendLogEntry(...), RaiseWarning(...), or counter/timing updates.
Health Snapshot Contract
HealthTelemetry should expose one coherent health snapshot that other publication layers can consume.
That snapshot should be able to answer, at minimum:
- what the overall app health is
- whether input signal is present
- whether playout is healthy, degraded, or underrunning
- whether render timing is within budget
- what active warnings exist
- what recent notable events occurred
- what the current subsystem-specific states are
The important boundary is:
HealthTelemetrybuilds the health snapshotControlServicesmay publish it- UI consumes it
That avoids rebuilding health summaries ad hoc in UI-facing runtime state serializers.
Concurrency Expectations
This subsystem will likely receive updates from multiple threads:
- control ingress threads
- render thread
- backend callback threads
- coordinator/service threads
So the design should assume:
- low-contention write paths
- bounded memory
- no long-held global mutex that callbacks and render both depend on
Phase 1 does not require lock-free implementation, but it does require the architecture to avoid recreating the RuntimeHost problem where health writes share the same lock as durable state and render-facing concerns.
Practical expectations:
- per-domain aggregation or lightweight internal locking is acceptable
- read snapshots should be cheap and stable
- callback paths should record telemetry cheaply and return
Migration Plan From Current Code
The safest migration path is to peel telemetry responsibilities away from the existing classes incrementally.
Step 1: Introduce The HealthTelemetry Interface
Create a small interface and health model types first.
Initial responsibilities:
- append structured logs
- record timing samples
- record counter deltas
- raise and clear warnings
- build a read-only health snapshot
The first implementation can still be backed by simple in-memory structures.
Step 2: Move New Observations Off RuntimeHost
Before removing old setters, route new health-style work into HealthTelemetry instead of adding more RuntimeHost status fields.
This prevents the old status surface from growing during migration.
Step 3: Replace RuntimeHost Status Setters With Telemetry Producers
Refactor:
- render timing writes
- signal status writes
- playout pacing writes
so they publish structured observations instead of mutating store-adjacent fields.
Step 4: Replace Direct OutputDebugStringA Warning Paths
Wrap common warning/error cases in telemetry producers.
This includes:
- OSC decode/dispatch failures
- DeckLink late/dropped frame notifications
- backend fallback notices
- screenshot write failures
Direct debug output can remain as a sink of telemetry if desired, but not as the primary source of truth.
Step 5: Publish Health Snapshot Through UI/Diagnostics Paths
Once the snapshot format exists, let ControlServices publish health summaries and recent warnings explicitly rather than depending on the runtime-state payload alone.
Risks
1. Telemetry becomes a hidden behavior controller
If warning states start being used as the indirect way subsystems tell each other what to do, the subsystem boundary will fail.
Guardrail:
- telemetry observes and reports
- it does not coordinate or command
2. Logging stays string-only
If the subsystem only centralizes text logging without structure, later diagnostics will still be difficult.
Guardrail:
- severity, subsystem, category, and optional fields should be first-class
3. Timing writes become too expensive
If every sample requires heavy locking or snapshot rebuilds, render and callback timing could regress.
Guardrail:
- cheap recording path
- derived summaries built separately from hot-path writes
4. Health snapshot duplicates runtime truth
If health snapshots start storing copies of durable runtime state, the subsystem boundary will blur again.
Guardrail:
- health snapshots summarize operational state
- they do not become a second runtime store
5. Warning severity semantics drift by subsystem
If each subsystem invents its own meaning for warning/degraded/error, operator visibility becomes noisy and inconsistent.
Guardrail:
- define shared severity and health-state vocabulary early
Open Questions
1. Should debug-output sinks remain enabled by default?
Current recommendation:
- yes, as a sink fed by structured telemetry entries, not as the source of truth
2. How much timing history should be retained in memory?
Current recommendation:
- enough for short-term live troubleshooting and UI summaries
- not an unbounded time-series archive
3. Should operator-facing health and engineering diagnostics use the same snapshot?
Current recommendation:
- share one core telemetry model
- allow separate derived views for concise operator summaries versus deeper engineering detail
4. Where should threshold policy live if the app later formalizes warnings like "render over budget"?
Current recommendation:
- telemetry may evaluate declared thresholds
- subsystem owners still own mitigation behavior
5. Should input signal presence remain part of runtime state or move fully into telemetry?
Current recommendation:
- treat it as operational health state under
VideoBackendreporting into telemetry - avoid keeping it as a core durable runtime-store concern
Success Criteria For This Subsystem
HealthTelemetry can be considered well-defined once the codebase can say, without ambiguity:
- all major subsystems have one place to publish timing, warnings, and counters
- health and timing state no longer share ownership with durable runtime state
- the UI can consume a stable health snapshot without scraping unrelated runtime fields
- direct debug-string warning paths are being retired or wrapped behind structured telemetry
- degraded-but-running conditions are visible as first-class state
Short Version
HealthTelemetry is the subsystem that should answer:
- what is healthy right now
- what is degraded right now
- what recent warnings and errors occurred
- how render, control, and playout timing are behaving
It should:
- collect structured logs
- collect warnings and counters
- collect timing samples and gauges
- build stable health snapshots for publication
It should not:
- own core runtime truth
- decide app behavior
- coordinate recovery actions
- become a replacement for the render or backend policy layers
If this boundary holds, later phases can remove timing and warning state from RuntimeHost and move toward a much more diagnosable live system.