aiden/video-shader-toys

Fork 0

Files

Aiden 120f899b0d

CI / React UI Build (push) Successful in 11s

Details

CI / Windows Release Package (push) Has been cancelled

Details

CI / Native Windows Build And Tests (push) Has been cancelled

Details

docs

2026-05-10 23:57:05 +10:00

22 KiB

Raw Blame History

HealthTelemetry Subsystem Design

This document expands the HealthTelemetry subsystem introduced in PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md.

HealthTelemetry is the subsystem that owns operational visibility for the app. Its purpose is to gather health state, warnings, counters, logs, and timing observations from the other subsystems and publish them in a structured way without becoming a second control plane.

Today, those responsibilities are fragmented across RuntimeHost status setters, ad hoc OutputDebugStringA calls, callback-local warnings, and UI-facing runtime-state payloads. The result is that the app can often detect problems, but it does not yet have one clear place that answers:

what is healthy right now
what is degraded right now
what has recently gone wrong
which subsystem is under pressure
how timing behavior is trending over time

HealthTelemetry is the target boundary that should answer those questions.

Why This Subsystem Exists

The current code already contains meaningful health and timing signals, but they are spread through unrelated ownership domains:

RuntimeHost stores signal and timing status:
render and bridge code report timing by writing back into RuntimeHost:
- OpenGLRenderPipeline.cpp
- OpenGLVideoIOBridge.cpp
backend warning paths still log directly:
- DeckLinkFrameTransfer.cpp
- DeckLinkSession.cpp
control ingress failures still log directly:
- OscServer.cpp
- RuntimeServices.cpp

This creates several recurring problems:

health information shares storage and lock scope with runtime state
warnings are not consistently classified by subsystem or severity
timing data is hard to compare across render, control, and backend paths
UI connection state and operational state are too closely coupled
logging is mostly text-first instead of structured-first
recovery behavior is hard to audit because the app does not retain a coherent health snapshot

HealthTelemetry exists so later phases can move timing and health concerns out of RuntimeHost, out of callback-local logging, and into one subsystem whose only job is observation and reporting.

Design Goals

HealthTelemetry should optimize for:

one authoritative home for operational visibility
structured health state per subsystem
timing and counter recording that does not require a UI to be connected
low-friction reporting from render, backend, coordinator, and services
explicit degraded-mode reporting instead of only raw text logs
support for live operator summaries and deeper engineering diagnostics
minimal risk of telemetry writes becoming a render or callback bottleneck

Responsibilities

HealthTelemetry owns structured operational visibility.

Primary responsibilities:

accept timing samples from major subsystems
accept counter deltas and point-in-time gauges
accept warning, error, and degraded-state transitions
collect subsystem-scoped health state
collect operator-visible summary state
collect structured log entries
build stable health snapshots for UI, diagnostics, and later persistence/export if desired
retain recent history needed for short-term troubleshooting
classify observations by subsystem, severity, and category

Secondary responsibilities that still fit here:

smoothing or rolling-window summaries for timing metrics
mapping raw subsystem observations into operator-facing health summaries
deduplicating repeated warnings
tracking warning open/clear lifecycles
providing bounded in-memory history for recent logs and warning transitions

Explicit Non-Responsibilities

HealthTelemetry should not become a behavior owner.

It does not own:

layer stack truth
persistence policy
render scheduling
DeckLink scheduling
OSC buffering or routing
reload coordination
shader compilation
recovery actions themselves

It also should not decide:

whether render should skip a frame
whether VideoBackend should increase queue depth
whether RuntimeCoordinator should reject a mutation
whether ControlServices should drop or coalesce ingress traffic

Those decisions belong to the subsystem being observed. HealthTelemetry may describe that a subsystem is degraded, but it must not quietly become the mechanism that tells the app how to react.

Ownership Boundaries

HealthTelemetry owns the following state categories.

Structured Log State

Examples:

subsystem name
severity
category
timestamp
message
optional structured fields such as layer id, preset name, queue depth, or shader id

This replaces the idea that OutputDebugStringA text is itself the main diagnostic product.

Warning And Error State

Examples:

active warning set
warning occurrence counts
first-seen and last-seen timestamps
clear timestamps
subsystem-scoped degraded flags

This is the durable in-memory operational state that should answer "what is currently wrong?" even if no UI was connected when the warning was raised.

Timing State

Examples:

render duration
frame budget
playout completion interval
smoothed completion interval
queue depth
input upload skip count
async readback fallback count
control ingress lag or queue depth
snapshot publication cost

This state should be organized as time-series-like rolling telemetry, not as a grab bag of unrelated double fields mixed into the runtime store.

Health Snapshot State

Examples:

current subsystem health summaries
current operator-facing overall health summary
most recent warning list
recent counters and timing summaries
"degraded but still running" status

This is the material that ControlServices or a diagnostics endpoint may later publish.

State Model

The subsystem should model health and telemetry in a way that supports both machine-friendly and operator-friendly views.

Suggested conceptual model:

TelemetryLogEntry
TelemetryWarningRecord
TelemetryCounterState
TelemetryGaugeState
TelemetryTimingSeries
SubsystemHealthState
HealthSnapshot

Important distinction:

raw observations are append/update operations
health snapshots are derived read models

That distinction matters because the system should be able to retain richer recent telemetry internally than what is necessarily sent to the UI on every refresh.

Subsystem Health Domains

HealthTelemetry should track health by subsystem rather than as one flat status blob.

At minimum, Phase 1 should assume domains for:

RuntimeStore
RuntimeCoordinator
RuntimeSnapshotProvider
ControlServices
RenderEngine
VideoBackend

Optional cross-cutting domain:

ApplicationShell

Each domain should be able to express states such as:

Healthy
Warning
Degraded
Error
Unavailable

The exact enum can change, but the design should preserve the idea that each subsystem reports into its own health lane first, and only then is an overall status derived.

Logging Boundaries

Logging belongs here, but logging should be structured-first.

Expected inputs:

subsystem-scoped debug information
warning and error messages
recovery events
notable state transitions
significant operator actions that matter for diagnostics

Expected design rules:

textual messages are still useful, but they should be wrapped in a structured log entry
repeated transient failures should be rate-limited or deduplicated at the telemetry layer where possible
log storage should be bounded in memory
UI publication should read from health/log snapshots, not scrape stdout/debug output

Examples of current direct log paths that should eventually move behind HealthTelemetry:

OSC decode/dispatch failures
screenshot write failures
DeckLink fallback warnings
late/dropped frame warnings

Metrics And Timing Boundaries

Timing and metrics should also move here, but their ownership line matters.

HealthTelemetry should own:

metric collection interfaces
rolling summaries
recent history buffers
warning thresholds if the app later chooses to define them declaratively
operator-facing derived summaries

The producing subsystem should still own:

the meaning of the measurement
when it is sampled
whether it triggers local mitigation

Examples:

RenderEngine owns when render duration is sampled
VideoBackend owns when queue depth or playout lateness is sampled
ControlServices owns when ingress backlog is sampled
RuntimeSnapshotProvider owns when snapshot publish/build timing is sampled

HealthTelemetry should not invent those timings by inference. It records them when producers report them.

Proposed Interfaces

These are target-shape interfaces, not final signatures.

Write/Record Interface

Core write-side operations could look like:

enum class TelemetrySeverity;
enum class TelemetrySubsystem;

struct TelemetryLogEntry;
struct TelemetryWarning;
struct TelemetryTimingSample;
struct TelemetryCounterDelta;
struct TelemetryGaugeUpdate;

class IHealthTelemetry
{
public:
    virtual void AppendLogEntry(const TelemetryLogEntry& entry) = 0;
    virtual void RaiseWarning(const TelemetryWarning& warning) = 0;
    virtual void ClearWarning(std::string_view warningKey) = 0;
    virtual void RecordTimingSample(const TelemetryTimingSample& sample) = 0;
    virtual void RecordCounterDelta(const TelemetryCounterDelta& delta) = 0;
    virtual void RecordGauge(const TelemetryGaugeUpdate& gauge) = 0;
    virtual void ReportSubsystemState(TelemetrySubsystem subsystem,
                                      SubsystemHealthState state) = 0;
};

The key is that every subsystem should be able to publish observations without also needing to know how UI payloads, rolling summaries, or log retention are implemented.

Read Interface

Expected read-side operations:

BuildHealthSnapshot()
GetSubsystemHealth(...)
GetRecentLogs(...)
GetActiveWarnings()
GetRecentTimingSummary(...)

Design notes:

the read interface should return stable snapshots or read models
UI/websocket publication should consume those snapshots through ControlServices
read-side access should not require direct knowledge of internal ring buffers or lock layout

Producer Expectations By Subsystem

The parent Phase 1 design already allows multiple subsystems to publish into telemetry. This section makes that concrete.

From `RuntimeCoordinator`

Expected observations:

mutation rejected
reload requested
preset apply failed
transient state cleared due to compatibility rules
policy-driven degraded notices such as repeated invalid external control input

From `RuntimeSnapshotProvider`

Expected observations:

snapshot publication duration
snapshot build failure
snapshot version churn metrics
repeated publish retries or stale-snapshot conditions

From `ControlServices`

Expected observations:

OSC decode failures
websocket broadcast failures
REST/control transport errors
ingress queue depth
coalescing/drop counts
file-watch reload request activity

From `RenderEngine`

Expected observations:

frame render duration
upload duration
readback duration
fallback to synchronous readback
preview present timing
render-local state resets caused by reload or incompatibility

From `VideoBackend`

Expected observations:

current playout queue depth
input signal state
late frames
dropped frames
backend mode changes
fallback from 10-bit to 8-bit input
output-only black-frame mode

Current Code Mapping

The current codebase already contains several telemetry responsibilities that should migrate here.

`RuntimeHost` Status Setters

These are the clearest existing candidates:

SetSignalStatus(...)
TrySetSignalStatus(...)
SetPerformanceStats(...)
TrySetPerformanceStats(...)
SetFramePacingStats(...)
TrySetFramePacingStats(...)

See:

In the target architecture, this kind of state should no longer sit on the same object that owns persistent layer truth.

Render Timing Production

Current render timing is produced in:

OpenGLRenderPipeline.cpp

That timing sample should conceptually become:

RenderEngine -> HealthTelemetry::RecordTimingSample(...)

not:

RenderEngine -> RuntimeHost::TrySetPerformanceStats(...)

Playout And Signal Status Production

Current signal and frame pacing updates are produced in:

These should eventually become structured VideoBackend observations instead of bridge-to-host status writes.

Direct Warning And Log Paths

Current examples:

late/dropped frame warnings:
- DeckLinkFrameTransfer.cpp
backend fallback warnings:
- DeckLinkSession.cpp
- DeckLinkSession.cpp
OSC errors:
- OscServer.cpp
- RuntimeServices.cpp

All of these are clear migration candidates for AppendLogEntry(...), RaiseWarning(...), or counter/timing updates.

Health Snapshot Contract

HealthTelemetry should expose one coherent health snapshot that other publication layers can consume.

That snapshot should be able to answer, at minimum:

what the overall app health is
whether input signal is present
whether playout is healthy, degraded, or underrunning
whether render timing is within budget
what active warnings exist
what recent notable events occurred
what the current subsystem-specific states are

The important boundary is:

HealthTelemetry builds the health snapshot
ControlServices may publish it
UI consumes it

That avoids rebuilding health summaries ad hoc in UI-facing runtime state serializers.

Concurrency Expectations

This subsystem will likely receive updates from multiple threads:

control ingress threads
render thread
backend callback threads
coordinator/service threads

So the design should assume:

low-contention write paths
bounded memory
no long-held global mutex that callbacks and render both depend on

Phase 1 does not require lock-free implementation, but it does require the architecture to avoid recreating the RuntimeHost problem where health writes share the same lock as durable state and render-facing concerns.

Practical expectations:

per-domain aggregation or lightweight internal locking is acceptable
read snapshots should be cheap and stable
callback paths should record telemetry cheaply and return

Migration Plan From Current Code

The safest migration path is to peel telemetry responsibilities away from the existing classes incrementally.

Step 1: Introduce The `HealthTelemetry` Interface

Create a small interface and health model types first.

Initial responsibilities:

append structured logs
record timing samples
record counter deltas
raise and clear warnings
build a read-only health snapshot

The first implementation can still be backed by simple in-memory structures.

Step 2: Move New Observations Off `RuntimeHost`

Before removing old setters, route new health-style work into HealthTelemetry instead of adding more RuntimeHost status fields.

This prevents the old status surface from growing during migration.

Step 3: Replace `RuntimeHost` Status Setters With Telemetry Producers

Refactor:

render timing writes
signal status writes
playout pacing writes

so they publish structured observations instead of mutating store-adjacent fields.

Step 4: Replace Direct `OutputDebugStringA` Warning Paths

Wrap common warning/error cases in telemetry producers.

This includes:

OSC decode/dispatch failures
DeckLink late/dropped frame notifications
backend fallback notices
screenshot write failures

Direct debug output can remain as a sink of telemetry if desired, but not as the primary source of truth.

Step 5: Publish Health Snapshot Through UI/Diagnostics Paths

Once the snapshot format exists, let ControlServices publish health summaries and recent warnings explicitly rather than depending on the runtime-state payload alone.

Risks

1. Telemetry becomes a hidden behavior controller

If warning states start being used as the indirect way subsystems tell each other what to do, the subsystem boundary will fail.

Guardrail:

telemetry observes and reports
it does not coordinate or command

2. Logging stays string-only

If the subsystem only centralizes text logging without structure, later diagnostics will still be difficult.

Guardrail:

severity, subsystem, category, and optional fields should be first-class

3. Timing writes become too expensive

If every sample requires heavy locking or snapshot rebuilds, render and callback timing could regress.

Guardrail:

cheap recording path
derived summaries built separately from hot-path writes

4. Health snapshot duplicates runtime truth

If health snapshots start storing copies of durable runtime state, the subsystem boundary will blur again.

Guardrail:

health snapshots summarize operational state
they do not become a second runtime store

5. Warning severity semantics drift by subsystem

If each subsystem invents its own meaning for warning/degraded/error, operator visibility becomes noisy and inconsistent.

Guardrail:

define shared severity and health-state vocabulary early

Open Questions

1. Should debug-output sinks remain enabled by default?

Current recommendation:

yes, as a sink fed by structured telemetry entries, not as the source of truth

2. How much timing history should be retained in memory?

Current recommendation:

enough for short-term live troubleshooting and UI summaries
not an unbounded time-series archive

3. Should operator-facing health and engineering diagnostics use the same snapshot?

Current recommendation:

share one core telemetry model
allow separate derived views for concise operator summaries versus deeper engineering detail

4. Where should threshold policy live if the app later formalizes warnings like "render over budget"?

Current recommendation:

telemetry may evaluate declared thresholds
subsystem owners still own mitigation behavior

5. Should input signal presence remain part of runtime state or move fully into telemetry?

Current recommendation:

treat it as operational health state under VideoBackend reporting into telemetry
avoid keeping it as a core durable runtime-store concern

Success Criteria For This Subsystem

HealthTelemetry can be considered well-defined once the codebase can say, without ambiguity:

all major subsystems have one place to publish timing, warnings, and counters
health and timing state no longer share ownership with durable runtime state
the UI can consume a stable health snapshot without scraping unrelated runtime fields
direct debug-string warning paths are being retired or wrapped behind structured telemetry
degraded-but-running conditions are visible as first-class state

Short Version

HealthTelemetry is the subsystem that should answer:

what is healthy right now
what is degraded right now
what recent warnings and errors occurred
how render, control, and playout timing are behaving

It should:

collect structured logs
collect warnings and counters
collect timing samples and gauges
build stable health snapshots for publication

It should not:

own core runtime truth
decide app behavior
coordinate recovery actions
become a replacement for the render or backend policy layers

If this boundary holds, later phases can remove timing and warning state from RuntimeHost and move toward a much more diagnosable live system.

22 KiB Raw Blame History

HealthTelemetry Subsystem Design

Why This Subsystem Exists

Design Goals

Responsibilities

Explicit Non-Responsibilities

Ownership Boundaries

Structured Log State

Warning And Error State

Timing State

Health Snapshot State

State Model

Subsystem Health Domains

Logging Boundaries

Metrics And Timing Boundaries

Proposed Interfaces

Write/Record Interface

Read Interface

Producer Expectations By Subsystem

From RuntimeCoordinator

From RuntimeSnapshotProvider

From ControlServices

From RenderEngine

From VideoBackend

Current Code Mapping

RuntimeHost Status Setters

Render Timing Production

Playout And Signal Status Production

Direct Warning And Log Paths

Health Snapshot Contract

Concurrency Expectations

Migration Plan From Current Code

Step 1: Introduce The HealthTelemetry Interface

Step 2: Move New Observations Off RuntimeHost

Step 3: Replace RuntimeHost Status Setters With Telemetry Producers

Step 4: Replace Direct OutputDebugStringA Warning Paths

Step 5: Publish Health Snapshot Through UI/Diagnostics Paths

Risks

1. Telemetry becomes a hidden behavior controller

2. Logging stays string-only

3. Timing writes become too expensive

4. Health snapshot duplicates runtime truth

5. Warning severity semantics drift by subsystem

Open Questions

1. Should debug-output sinks remain enabled by default?

2. How much timing history should be retained in memory?

3. Should operator-facing health and engineering diagnostics use the same snapshot?

4. Where should threshold policy live if the app later formalizes warnings like "render over budget"?

5. Should input signal presence remain part of runtime state or move fully into telemetry?

Success Criteria For This Subsystem

Short Version

22 KiB

Raw Blame History

From `RuntimeCoordinator`

From `RuntimeSnapshotProvider`

From `ControlServices`

From `RenderEngine`

From `VideoBackend`

`RuntimeHost` Status Setters

Step 1: Introduce The `HealthTelemetry` Interface

Step 2: Move New Observations Off `RuntimeHost`

Step 3: Replace `RuntimeHost` Status Setters With Telemetry Producers

Step 4: Replace Direct `OutputDebugStringA` Warning Paths