648 lines
22 KiB
Markdown
648 lines
22 KiB
Markdown
# HealthTelemetry Subsystem Design
|
|
|
|
This document expands the `HealthTelemetry` subsystem introduced in [PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md).
|
|
|
|
`HealthTelemetry` is the subsystem that owns operational visibility for the app. Its purpose is to gather health state, warnings, counters, logs, and timing observations from the other subsystems and publish them in a structured way without becoming a second control plane.
|
|
|
|
Before the Phase 1 runtime split, those responsibilities were fragmented across `RuntimeHost` status setters, ad hoc `OutputDebugStringA` calls, callback-local warnings, and UI-facing runtime-state payloads. The result was that the app could often detect problems, but did not yet have one clear place that answered:
|
|
|
|
- what is healthy right now
|
|
- what is degraded right now
|
|
- what has recently gone wrong
|
|
- which subsystem is under pressure
|
|
- how timing behavior is trending over time
|
|
|
|
`HealthTelemetry` is the target boundary that should answer those questions.
|
|
|
|
## Why This Subsystem Exists
|
|
|
|
The codebase already contains meaningful health and timing signals, but some are still spread through unrelated ownership domains:
|
|
|
|
- previous `RuntimeHost` status fields stored signal and timing status:
|
|
- `RuntimeHost.h`
|
|
- `RuntimeHost.cpp`
|
|
- `RuntimeHost.cpp`
|
|
- `RuntimeHost.cpp`
|
|
- render and bridge code historically reported timing by writing back into `RuntimeHost`:
|
|
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:50)
|
|
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:49)
|
|
- backend warning paths still log directly:
|
|
- [DeckLinkFrameTransfer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkFrameTransfer.cpp:84)
|
|
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:305)
|
|
- control ingress failures still log directly:
|
|
- [OscServer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/OscServer.cpp:142)
|
|
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:100)
|
|
|
|
This creates several recurring problems:
|
|
|
|
- health information shares storage and lock scope with runtime state
|
|
- warnings are not consistently classified by subsystem or severity
|
|
- timing data is hard to compare across render, control, and backend paths
|
|
- UI connection state and operational state are too closely coupled
|
|
- logging is mostly text-first instead of structured-first
|
|
- recovery behavior is hard to audit because the app does not retain a coherent health snapshot
|
|
|
|
`HealthTelemetry` exists so timing and health concerns have one subsystem whose only job is observation and reporting, instead of drifting back into runtime storage, callback-local logging, or UI payload assembly.
|
|
|
|
## Design Goals
|
|
|
|
`HealthTelemetry` should optimize for:
|
|
|
|
- one authoritative home for operational visibility
|
|
- structured health state per subsystem
|
|
- timing and counter recording that does not require a UI to be connected
|
|
- low-friction reporting from render, backend, coordinator, and services
|
|
- explicit degraded-mode reporting instead of only raw text logs
|
|
- support for live operator summaries and deeper engineering diagnostics
|
|
- minimal risk of telemetry writes becoming a render or callback bottleneck
|
|
|
|
## Responsibilities
|
|
|
|
`HealthTelemetry` owns structured operational visibility.
|
|
|
|
Primary responsibilities:
|
|
|
|
- accept timing samples from major subsystems
|
|
- accept counter deltas and point-in-time gauges
|
|
- accept warning, error, and degraded-state transitions
|
|
- collect subsystem-scoped health state
|
|
- collect operator-visible summary state
|
|
- collect structured log entries
|
|
- build stable health snapshots for UI, diagnostics, and later persistence/export if desired
|
|
- retain recent history needed for short-term troubleshooting
|
|
- classify observations by subsystem, severity, and category
|
|
|
|
Secondary responsibilities that still fit here:
|
|
|
|
- smoothing or rolling-window summaries for timing metrics
|
|
- mapping raw subsystem observations into operator-facing health summaries
|
|
- deduplicating repeated warnings
|
|
- tracking warning open/clear lifecycles
|
|
- providing bounded in-memory history for recent logs and warning transitions
|
|
|
|
## Explicit Non-Responsibilities
|
|
|
|
`HealthTelemetry` should not become a behavior owner.
|
|
|
|
It does not own:
|
|
|
|
- layer stack truth
|
|
- persistence policy
|
|
- render scheduling
|
|
- DeckLink scheduling
|
|
- OSC buffering or routing
|
|
- reload coordination
|
|
- shader compilation
|
|
- recovery actions themselves
|
|
|
|
It also should not decide:
|
|
|
|
- whether render should skip a frame
|
|
- whether VideoBackend should increase queue depth
|
|
- whether RuntimeCoordinator should reject a mutation
|
|
- whether ControlServices should drop or coalesce ingress traffic
|
|
|
|
Those decisions belong to the subsystem being observed. `HealthTelemetry` may describe that a subsystem is degraded, but it must not quietly become the mechanism that tells the app how to react.
|
|
|
|
## Ownership Boundaries
|
|
|
|
`HealthTelemetry` owns the following state categories.
|
|
|
|
### Structured Log State
|
|
|
|
Examples:
|
|
|
|
- subsystem name
|
|
- severity
|
|
- category
|
|
- timestamp
|
|
- message
|
|
- optional structured fields such as layer id, preset name, queue depth, or shader id
|
|
|
|
This replaces the idea that `OutputDebugStringA` text is itself the main diagnostic product.
|
|
|
|
### Warning And Error State
|
|
|
|
Examples:
|
|
|
|
- active warning set
|
|
- warning occurrence counts
|
|
- first-seen and last-seen timestamps
|
|
- clear timestamps
|
|
- subsystem-scoped degraded flags
|
|
|
|
This is the durable in-memory operational state that should answer "what is currently wrong?" even if no UI was connected when the warning was raised.
|
|
|
|
### Timing State
|
|
|
|
Examples:
|
|
|
|
- render duration
|
|
- frame budget
|
|
- playout completion interval
|
|
- smoothed completion interval
|
|
- queue depth
|
|
- input upload skip count
|
|
- async readback fallback count
|
|
- control ingress lag or queue depth
|
|
- snapshot publication cost
|
|
|
|
This state should be organized as time-series-like rolling telemetry, not as a grab bag of unrelated `double` fields mixed into the runtime store.
|
|
|
|
### Health Snapshot State
|
|
|
|
Examples:
|
|
|
|
- current subsystem health summaries
|
|
- current operator-facing overall health summary
|
|
- most recent warning list
|
|
- recent counters and timing summaries
|
|
- "degraded but still running" status
|
|
|
|
This is the material that `ControlServices` or a diagnostics endpoint may later publish.
|
|
|
|
## State Model
|
|
|
|
The subsystem should model health and telemetry in a way that supports both machine-friendly and operator-friendly views.
|
|
|
|
Suggested conceptual model:
|
|
|
|
- `TelemetryLogEntry`
|
|
- `TelemetryWarningRecord`
|
|
- `TelemetryCounterState`
|
|
- `TelemetryGaugeState`
|
|
- `TelemetryTimingSeries`
|
|
- `SubsystemHealthState`
|
|
- `HealthSnapshot`
|
|
|
|
Important distinction:
|
|
|
|
- raw observations are append/update operations
|
|
- health snapshots are derived read models
|
|
|
|
That distinction matters because the system should be able to retain richer recent telemetry internally than what is necessarily sent to the UI on every refresh.
|
|
|
|
## Subsystem Health Domains
|
|
|
|
`HealthTelemetry` should track health by subsystem rather than as one flat status blob.
|
|
|
|
At minimum, Phase 1 should assume domains for:
|
|
|
|
- `RuntimeStore`
|
|
- `RuntimeCoordinator`
|
|
- `RuntimeSnapshotProvider`
|
|
- `ControlServices`
|
|
- `RenderEngine`
|
|
- `VideoBackend`
|
|
|
|
Optional cross-cutting domain:
|
|
|
|
- `ApplicationShell`
|
|
|
|
Each domain should be able to express states such as:
|
|
|
|
- `Healthy`
|
|
- `Warning`
|
|
- `Degraded`
|
|
- `Error`
|
|
- `Unavailable`
|
|
|
|
The exact enum can change, but the design should preserve the idea that each subsystem reports into its own health lane first, and only then is an overall status derived.
|
|
|
|
## Logging Boundaries
|
|
|
|
Logging belongs here, but logging should be structured-first.
|
|
|
|
Expected inputs:
|
|
|
|
- subsystem-scoped debug information
|
|
- warning and error messages
|
|
- recovery events
|
|
- notable state transitions
|
|
- significant operator actions that matter for diagnostics
|
|
|
|
Expected design rules:
|
|
|
|
- textual messages are still useful, but they should be wrapped in a structured log entry
|
|
- repeated transient failures should be rate-limited or deduplicated at the telemetry layer where possible
|
|
- log storage should be bounded in memory
|
|
- UI publication should read from health/log snapshots, not scrape stdout/debug output
|
|
|
|
Examples of current direct log paths that should eventually move behind `HealthTelemetry`:
|
|
|
|
- OSC decode/dispatch failures
|
|
- screenshot write failures
|
|
- DeckLink fallback warnings
|
|
- late/dropped frame warnings
|
|
|
|
## Metrics And Timing Boundaries
|
|
|
|
Timing and metrics should also move here, but their ownership line matters.
|
|
|
|
`HealthTelemetry` should own:
|
|
|
|
- metric collection interfaces
|
|
- rolling summaries
|
|
- recent history buffers
|
|
- warning thresholds if the app later chooses to define them declaratively
|
|
- operator-facing derived summaries
|
|
|
|
The producing subsystem should still own:
|
|
|
|
- the meaning of the measurement
|
|
- when it is sampled
|
|
- whether it triggers local mitigation
|
|
|
|
Examples:
|
|
|
|
- `RenderEngine` owns when render duration is sampled
|
|
- `VideoBackend` owns when queue depth or playout lateness is sampled
|
|
- `ControlServices` owns when ingress backlog is sampled
|
|
- `RuntimeSnapshotProvider` owns when snapshot publish/build timing is sampled
|
|
|
|
`HealthTelemetry` should not invent those timings by inference. It records them when producers report them.
|
|
|
|
## Proposed Interfaces
|
|
|
|
These are target-shape interfaces, not final signatures.
|
|
|
|
### Write/Record Interface
|
|
|
|
Core write-side operations could look like:
|
|
|
|
```cpp
|
|
enum class TelemetrySeverity;
|
|
enum class TelemetrySubsystem;
|
|
|
|
struct TelemetryLogEntry;
|
|
struct TelemetryWarning;
|
|
struct TelemetryTimingSample;
|
|
struct TelemetryCounterDelta;
|
|
struct TelemetryGaugeUpdate;
|
|
|
|
class IHealthTelemetry
|
|
{
|
|
public:
|
|
virtual void AppendLogEntry(const TelemetryLogEntry& entry) = 0;
|
|
virtual void RaiseWarning(const TelemetryWarning& warning) = 0;
|
|
virtual void ClearWarning(std::string_view warningKey) = 0;
|
|
virtual void RecordTimingSample(const TelemetryTimingSample& sample) = 0;
|
|
virtual void RecordCounterDelta(const TelemetryCounterDelta& delta) = 0;
|
|
virtual void RecordGauge(const TelemetryGaugeUpdate& gauge) = 0;
|
|
virtual void ReportSubsystemState(TelemetrySubsystem subsystem,
|
|
SubsystemHealthState state) = 0;
|
|
};
|
|
```
|
|
|
|
The key is that every subsystem should be able to publish observations without also needing to know how UI payloads, rolling summaries, or log retention are implemented.
|
|
|
|
### Read Interface
|
|
|
|
Expected read-side operations:
|
|
|
|
- `BuildHealthSnapshot()`
|
|
- `GetSubsystemHealth(...)`
|
|
- `GetRecentLogs(...)`
|
|
- `GetActiveWarnings()`
|
|
- `GetRecentTimingSummary(...)`
|
|
|
|
Design notes:
|
|
|
|
- the read interface should return stable snapshots or read models
|
|
- UI/websocket publication should consume those snapshots through `ControlServices`
|
|
- read-side access should not require direct knowledge of internal ring buffers or lock layout
|
|
|
|
## Producer Expectations By Subsystem
|
|
|
|
The parent Phase 1 design already allows multiple subsystems to publish into telemetry. This section makes that concrete.
|
|
|
|
### From `RuntimeCoordinator`
|
|
|
|
Expected observations:
|
|
|
|
- mutation rejected
|
|
- reload requested
|
|
- preset apply failed
|
|
- transient state cleared due to compatibility rules
|
|
- policy-driven degraded notices such as repeated invalid external control input
|
|
|
|
### From `RuntimeSnapshotProvider`
|
|
|
|
Expected observations:
|
|
|
|
- snapshot publication duration
|
|
- snapshot build failure
|
|
- snapshot version churn metrics
|
|
- repeated publish retries or stale-snapshot conditions
|
|
|
|
### From `ControlServices`
|
|
|
|
Expected observations:
|
|
|
|
- OSC decode failures
|
|
- websocket broadcast failures
|
|
- REST/control transport errors
|
|
- ingress queue depth
|
|
- coalescing/drop counts
|
|
- file-watch reload request activity
|
|
|
|
### From `RenderEngine`
|
|
|
|
Expected observations:
|
|
|
|
- frame render duration
|
|
- upload duration
|
|
- readback duration
|
|
- fallback to synchronous readback
|
|
- preview present timing
|
|
- render-local state resets caused by reload or incompatibility
|
|
|
|
### From `VideoBackend`
|
|
|
|
Expected observations:
|
|
|
|
- current playout queue depth
|
|
- system-memory playout frame counts by state: free, ready, and scheduled
|
|
- system-memory playout underrun, repeat, and drop counters
|
|
- system-memory frame age at schedule and completion time
|
|
- input signal state
|
|
- late frames
|
|
- dropped frames
|
|
- backend mode changes
|
|
- fallback from 10-bit to 8-bit input
|
|
- output-only black-frame mode
|
|
|
|
## Current Code Mapping
|
|
|
|
The current codebase already contains several telemetry responsibilities that should migrate here.
|
|
|
|
### Previous `RuntimeHost` Status Setters
|
|
|
|
These were the clearest initial migration candidates:
|
|
|
|
- `SetSignalStatus(...)`
|
|
- `TrySetSignalStatus(...)`
|
|
- `SetPerformanceStats(...)`
|
|
- `TrySetPerformanceStats(...)`
|
|
- `SetFramePacingStats(...)`
|
|
- `TrySetFramePacingStats(...)`
|
|
|
|
See:
|
|
|
|
- `RuntimeHost.h`
|
|
- `RuntimeHost.cpp`
|
|
- `RuntimeHost.cpp`
|
|
- `RuntimeHost.cpp`
|
|
|
|
In the target architecture, this kind of state should not sit on the same object that owns persistent layer truth.
|
|
|
|
### Render Timing Production
|
|
|
|
Current render timing is produced in:
|
|
|
|
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:50)
|
|
|
|
That timing sample should conceptually become:
|
|
|
|
- `RenderEngine -> HealthTelemetry::RecordTimingSample(...)`
|
|
|
|
not the old pattern:
|
|
|
|
- `RenderEngine -> RuntimeHost::TrySetPerformanceStats(...)`
|
|
|
|
### Playout And Signal Status Production
|
|
|
|
Current signal and frame pacing updates are produced in:
|
|
|
|
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:49)
|
|
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:61)
|
|
|
|
These should eventually become structured `VideoBackend` observations instead of bridge-to-host status writes.
|
|
|
|
### Direct Warning And Log Paths
|
|
|
|
Current examples:
|
|
|
|
- late/dropped frame warnings:
|
|
- [DeckLinkFrameTransfer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkFrameTransfer.cpp:84)
|
|
- backend fallback warnings:
|
|
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:305)
|
|
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:320)
|
|
- OSC errors:
|
|
- [OscServer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/OscServer.cpp:142)
|
|
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:100)
|
|
|
|
All of these are clear migration candidates for `AppendLogEntry(...)`, `RaiseWarning(...)`, or counter/timing updates.
|
|
|
|
## Health Snapshot Contract
|
|
|
|
`HealthTelemetry` should expose one coherent health snapshot that other publication layers can consume.
|
|
|
|
That snapshot should be able to answer, at minimum:
|
|
|
|
- what the overall app health is
|
|
- whether input signal is present
|
|
- whether playout is healthy, degraded, or underrunning
|
|
- whether render timing is within budget
|
|
- what active warnings exist
|
|
- what recent notable events occurred
|
|
- what the current subsystem-specific states are
|
|
|
|
The important boundary is:
|
|
|
|
- `HealthTelemetry` builds the health snapshot
|
|
- `ControlServices` may publish it
|
|
- UI consumes it
|
|
|
|
That avoids rebuilding health summaries ad hoc in UI-facing runtime state serializers.
|
|
|
|
## Concurrency Expectations
|
|
|
|
This subsystem will likely receive updates from multiple threads:
|
|
|
|
- control ingress threads
|
|
- render thread
|
|
- backend callback threads
|
|
- coordinator/service threads
|
|
|
|
So the design should assume:
|
|
|
|
- low-contention write paths
|
|
- bounded memory
|
|
- no long-held global mutex that callbacks and render both depend on
|
|
|
|
Phase 1 does not require lock-free implementation, but it does require the architecture to avoid recreating the old problem where health writes share the same lock as durable state and render-facing concerns.
|
|
|
|
Practical expectations:
|
|
|
|
- per-domain aggregation or lightweight internal locking is acceptable
|
|
- read snapshots should be cheap and stable
|
|
- callback paths should record telemetry cheaply and return
|
|
|
|
## Migration Plan From Current Code
|
|
|
|
The safest migration path is to peel telemetry responsibilities away from the existing classes incrementally.
|
|
|
|
### Step 1: Introduce The `HealthTelemetry` Interface
|
|
|
|
Create a small interface and health model types first.
|
|
|
|
Initial responsibilities:
|
|
|
|
- append structured logs
|
|
- record timing samples
|
|
- record counter deltas
|
|
- raise and clear warnings
|
|
- build a read-only health snapshot
|
|
|
|
The first implementation can still be backed by simple in-memory structures.
|
|
|
|
### Step 2: Keep New Observations Off Runtime Storage
|
|
|
|
Route new health-style work into `HealthTelemetry` instead of adding more status fields to runtime storage.
|
|
|
|
This prevents the old status surface from growing during migration.
|
|
|
|
### Step 3: Replace Legacy Status Setters With Telemetry Producers
|
|
|
|
Refactor:
|
|
|
|
- render timing writes
|
|
- signal status writes
|
|
- playout pacing writes
|
|
|
|
so they publish structured observations instead of mutating store-adjacent fields.
|
|
|
|
### Step 4: Replace Direct `OutputDebugStringA` Warning Paths
|
|
|
|
Wrap common warning/error cases in telemetry producers.
|
|
|
|
This includes:
|
|
|
|
- OSC decode/dispatch failures
|
|
- DeckLink late/dropped frame notifications
|
|
- backend fallback notices
|
|
- screenshot write failures
|
|
|
|
Direct debug output can remain as a sink of telemetry if desired, but not as the primary source of truth.
|
|
|
|
### Step 5: Publish Health Snapshot Through UI/Diagnostics Paths
|
|
|
|
Once the snapshot format exists, let `ControlServices` publish health summaries and recent warnings explicitly rather than depending on the runtime-state payload alone.
|
|
|
|
## Risks
|
|
|
|
### 1. Telemetry becomes a hidden behavior controller
|
|
|
|
If warning states start being used as the indirect way subsystems tell each other what to do, the subsystem boundary will fail.
|
|
|
|
Guardrail:
|
|
|
|
- telemetry observes and reports
|
|
- it does not coordinate or command
|
|
|
|
### 2. Logging stays string-only
|
|
|
|
If the subsystem only centralizes text logging without structure, later diagnostics will still be difficult.
|
|
|
|
Guardrail:
|
|
|
|
- severity, subsystem, category, and optional fields should be first-class
|
|
|
|
### 3. Timing writes become too expensive
|
|
|
|
If every sample requires heavy locking or snapshot rebuilds, render and callback timing could regress.
|
|
|
|
Guardrail:
|
|
|
|
- cheap recording path
|
|
- derived summaries built separately from hot-path writes
|
|
|
|
### 4. Health snapshot duplicates runtime truth
|
|
|
|
If health snapshots start storing copies of durable runtime state, the subsystem boundary will blur again.
|
|
|
|
Guardrail:
|
|
|
|
- health snapshots summarize operational state
|
|
- they do not become a second runtime store
|
|
|
|
### 5. Warning severity semantics drift by subsystem
|
|
|
|
If each subsystem invents its own meaning for warning/degraded/error, operator visibility becomes noisy and inconsistent.
|
|
|
|
Guardrail:
|
|
|
|
- define shared severity and health-state vocabulary early
|
|
|
|
## Open Questions
|
|
|
|
### 1. Should debug-output sinks remain enabled by default?
|
|
|
|
Current recommendation:
|
|
|
|
- yes, as a sink fed by structured telemetry entries, not as the source of truth
|
|
|
|
### 2. How much timing history should be retained in memory?
|
|
|
|
Current recommendation:
|
|
|
|
- enough for short-term live troubleshooting and UI summaries
|
|
- not an unbounded time-series archive
|
|
|
|
### 3. Should operator-facing health and engineering diagnostics use the same snapshot?
|
|
|
|
Current recommendation:
|
|
|
|
- share one core telemetry model
|
|
- allow separate derived views for concise operator summaries versus deeper engineering detail
|
|
|
|
### 4. Where should threshold policy live if the app later formalizes warnings like "render over budget"?
|
|
|
|
Current recommendation:
|
|
|
|
- telemetry may evaluate declared thresholds
|
|
- subsystem owners still own mitigation behavior
|
|
|
|
### 5. Should input signal presence remain part of runtime state or move fully into telemetry?
|
|
|
|
Current recommendation:
|
|
|
|
- treat it as operational health state under `VideoBackend` reporting into telemetry
|
|
- avoid keeping it as a core durable runtime-store concern
|
|
|
|
## Success Criteria For This Subsystem
|
|
|
|
`HealthTelemetry` can be considered well-defined once the codebase can say, without ambiguity:
|
|
|
|
- all major subsystems have one place to publish timing, warnings, and counters
|
|
- health and timing state no longer share ownership with durable runtime state
|
|
- the UI can consume a stable health snapshot without scraping unrelated runtime fields
|
|
- direct debug-string warning paths are being retired or wrapped behind structured telemetry
|
|
- degraded-but-running conditions are visible as first-class state
|
|
|
|
## Short Version
|
|
|
|
`HealthTelemetry` is the subsystem that should answer:
|
|
|
|
- what is healthy right now
|
|
- what is degraded right now
|
|
- what recent warnings and errors occurred
|
|
- how render, control, and playout timing are behaving
|
|
|
|
It should:
|
|
|
|
- collect structured logs
|
|
- collect warnings and counters
|
|
- collect timing samples and gauges
|
|
- build stable health snapshots for publication
|
|
|
|
It should not:
|
|
|
|
- own core runtime truth
|
|
- decide app behavior
|
|
- coordinate recovery actions
|
|
- become a replacement for the render or backend policy layers
|
|
|
|
If this boundary holds, later phases can keep moving toward a much more diagnosable live system without putting timing and warning state back into runtime storage.
|