# Architecture Resilience Review

This note summarizes the main architectural improvements that would make the app more resilient during live use, especially around timing isolation, failure isolation, and recoverability.

Phase checklist:

- [x] Define subsystem boundaries and target architecture
- [x] Introduce an internal event model
- [x] Split `RuntimeHost`
- [x] Finish live-state and service-facing coordination
- [x] Make the render thread the sole GL owner
- [ ] Refactor live state layering into an explicit composition model
- [ ] Move persistence onto a background snapshot writer
- [ ] Make DeckLink/backend lifecycle explicit with a state machine
- [ ] Add structured health, telemetry, and operational reporting

Checklist note:

- The checked Phase 1 item means the subsystem vocabulary, dependency direction, state categories, design package, and runtime implementation foothold are in place.
- The checked Phase 2 item means the internal event model substrate is complete enough for later phases: the typed event vocabulary, app-owned dispatcher, coalesced event pump, reload bridge events, production bridges, and pure event tests are in place. Remaining items in [PHASE_2_INTERNAL_EVENT_MODEL_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_2_INTERNAL_EVENT_MODEL_DESIGN.md) are narrow follow-ups, mainly completion/failure observations and later replacement of the runtime-store poll fallback with real file-watch events.
- The checked Phase 3 item means the render-facing state path now has named live-state, composition, frame-state, resolver, and service-bridge boundaries. `OpenGLComposite::renderEffect()` is reduced to runtime work, frame input construction, and frame rendering.
- The checked Phase 4 item means normal runtime GL work is now owned by a dedicated `RenderEngine` render thread. Input upload, output render, preview, screenshot capture, render-local resets, and shader application enter through render-thread queue/request paths instead of caller-thread context borrowing. The remaining output timing risk is callback-coupled synchronous output production, which is intentionally tracked for the later DeckLink/backend lifecycle and playout-queue work.
- It does not mean the whole app is fully extracted. Deeper live-state layering, background persistence, backend lifecycle/playout queue policy, and richer telemetry continue through later phases.

## Timing Review

The recent OSC work removed several control-path stalls, but the app still has a few deeper timing characteristics that matter for live resilience:

- output playout is still effectively render-on-demand from the DeckLink completion callback
- output buffering and preroll are now larger, but the buffering model is still static and only loosely related to actual render cost
- GPU readback is partly asynchronous, but the fallback path still returns to synchronous readback on any miss
- preview presentation is best-effort and render-thread queued, but still shares the same render-thread budget as playout
- background service timing is partially event-driven; runtime-store scanning still uses a bounded compatibility poll fallback

Those points are important because they affect not just average performance, but how the app behaves under brief spikes, device jitter, or load bursts.

## Key Findings

### 1. The original runtime host carried too many responsibilities

The original `RuntimeHost` acted as:

- config store
- persistent state store
- live parameter/state authority
- shader package registry owner
- status/telemetry sink
- control mutation entrypoint

That makes it a single contention and failure domain. It is also why OSC and render timing issues repeatedly surfaced around shared state access.

Relevant code:

- `RuntimeHost.h`

Recommended direction:

- split persisted config/state from live render-facing state
- separate status/telemetry updates from control mutation paths
- make render consume snapshots rather than sharing a large mutable authority object

### 2. OpenGL ownership has moved to the render thread

Phase 4 removed normal runtime dependence on the old shared GL `CRITICAL_SECTION`. `RenderEngine` now owns a dedicated render thread and binds the GL context there for normal input upload, output rendering, preview presentation, screenshot capture, shader application, and render-local reset work.

Relevant code:

- [RenderEngine.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/RenderEngine.cpp:36)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:11)
- [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:168)

This removes cross-thread GL context borrowing as the central correctness model. The remaining timing risk is that output frame production is still synchronous from the DeckLink completion path, so a render/readback spike can still reduce playout headroom.

Recommended direction:

- keep the render thread as the sole GL owner
- replace synchronous output request/response with a bounded producer/consumer playout queue
- keep preview and screenshot subordinate to output deadline pressure

### 3. Control flow is spread across polling and shared-memory patterns

`RuntimeServices` currently mixes:

- file polling
- deferred OSC commit handling
- control service orchestration

OSC ingest, overlay application, and host sync are distributed across several components.

Relevant code:

- [RuntimeServices.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.h:26)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:178)

Recommended direction:

- introduce a small internal event pipeline or message bus
- use typed events for OSC, reloads, persistence requests, and status changes
- make timing ownership explicit per subsystem

Example event types:

- `OscParameterTargeted`
- `RenderOverlaySettled`
- `PersistStateRequested`
- `ShaderReloadRequested`
- `DeckLinkStatusChanged`

### 4. Error handling is still heavily UI-coupled

Failures are often surfaced via `MessageBoxA`, while background services mainly log with `OutputDebugStringA`.

Relevant code:

- [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:314)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:478)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:205)

This is not ideal for a live system where modal dialogs and silent debug logging are both poor operational behavior.

Recommended direction:

- introduce structured in-app error reporting
- define severity levels and counters
- prefer degraded runtime states over modal failure handling where possible
- add a rolling log file for operational troubleshooting

### 5. Live OSC overlay and persisted state are still separate concepts without a formal model

The current design works better now, but it still relies on hand-managed reconciliation between:

- persisted/committed parameter state in `RuntimeStore`
- transient OSC overlay state in `RenderEngine`

Relevant code:

- [RenderEngine.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/RenderEngine.h:18)

Recommended direction:

Formalize three layers of state:

- base persisted state
- operator/UI committed state
- transient live automation overlay

Then render can always resolve:

- `final = base + committed + transient`

That avoids special-case sync behavior becoming scattered across the code.

### 6. DeckLink lifecycle could be modeled more explicitly

`DeckLinkSession` has a number of imperative calls, but startup, preroll, running, degraded, and stopped are not represented as an explicit state machine.

Relevant code:

- [DeckLinkSession.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.h:17)

Recommended direction:

- introduce explicit session states
- define allowed transitions
- centralize recovery behavior
- make shutdown ordering and degraded-mode behavior more predictable

Timing-specific additions:

- separate "device callback received" from "render the next output frame" so output cadence is not driven directly by the completion callback thread
- make playout headroom configurable and adaptive instead of using a fixed compile-time preroll count
- track an explicit backend health state such as `running-steady`, `catching-up`, `late`, and `dropping`

Relevant timing code:

- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:86)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:420)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:487)
- [VideoPlayoutScheduler.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/VideoPlayoutScheduler.cpp:26)

Why this matters:

- the output completion path currently requests a scheduled render through `OpenGLVideoIOBridge::RenderScheduledFrame()`, which asks the render thread to render/read back synchronously and then schedules the next frame in one callback-driven flow.
- `VideoPlayoutScheduler::AccountForCompletionResult()` currently reacts to both late and dropped frames by blindly advancing the schedule index by `2`, which is simple but not especially robust.
- `kPrerollFrameCount` is now `12`, but `DeckLinkSession::ConfigureOutput()` still creates a fixed pool of `10` mutable output frames. That mismatch suggests the buffering model is not being sized from one coherent source of truth.

Recommended direction:

- move playout to a producer/consumer model where a render worker fills output buffers ahead of the DeckLink callback
- define buffer-pool sizing from one policy object, for example: preroll depth, minimum spare buffers, and allowed catch-up depth
- replace fixed "skip two frames" recovery with measured lag accounting based on actual scheduled-versus-completed position
- expose playout latency as a runtime setting or policy, rather than burying it in a constant

### 6a. The current playout timing model is still callback-coupled

The app now has more headroom, but the next output frame is still produced directly in the scheduled-frame completion callback path.

Relevant code:

- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:86)
- [DeckLinkFrameTransfer.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkFrameTransfer.cpp:53)

That means the completion callback is currently responsible for:

- frame pacing accounting
- acquiring the next output buffer
- requesting render-thread output production
- waiting for render/readback completion
- performing output readback
- scheduling the next frame

This works when the app is comfortably within budget, but it makes deadline misses much harder to absorb gracefully.

Recommended direction:

- make the DeckLink callback a lightweight notifier
- have a dedicated playout worker or render worker keep an ahead-of-time queue of ready output frames
- treat callback time as control-plane time, not render time

### 6b. A producer/consumer playout model would be a better long-term fit

The stronger architecture for this app is:

- a render scheduler or dedicated render thread runs at the configured video cadence
- rendering produces completed output frames ahead of need
- those frames are placed into a bounded queue or ring buffer
- the DeckLink side consumes already-prepared frames when callbacks indicate they are needed

That is a better fit than callback-driven rendering because it separates:

- render timing
- GL ownership
- output-device timing
- latency policy

In that model:

- render is the producer
- DeckLink is the timing consumer
- the queue between them becomes the main place to manage latency versus resilience

Why this is preferable:

- brief callback jitter is less likely to become a visible dropped frame
- render spikes can be absorbed by queue headroom instead of immediately missing output deadlines
- latency becomes an explicit policy choice rather than an incidental side effect of callback timing
- queue depth, underruns, stale-frame reuse, and catch-up behavior become measurable and tunable

Recommended direction:

- move toward a bounded producer/consumer playout queue
- make queue depth and target headroom runtime policy, not compile-time constants
- define explicit underrun behavior, for example:
  - reuse newest completed frame
  - reuse last scheduled frame
  - output black or degraded frame
- keep DeckLink callbacks limited to dequeue/schedule/accounting work wherever possible

### 7. Persistence should be more asynchronous and debounced

`SavePersistentState()` is still called directly from many update paths.

Relevant code:

- `RuntimeHost.cpp`

Recent OSC work already reduced this problem for live automation, but the broader architecture would still benefit from:

- a debounced persistence queue
- atomic write-behind snapshots
- clear separation between state mutation and disk flush

This improves both resilience and timing safety.

### 8. Telemetry is useful, but still too coarse

The app already records render timing and playout pacing, which is a good foundation.

Relevant code:

- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:24)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:24)

Recommended direction:

Add lightweight tracing for:

- input callback latency
- input upload skip count
- render-thread request latency
- render queue depth
- render time
- pass build/compile latency
- readback time
- output scheduling lag
- output queue depth
- preroll depth versus spare-buffer depth
- preview present cost and skipped-preview count
- control queue depth
- runtime state lock contention

That would make future tuning and failure diagnosis much easier.

Timing-specific observations from the current code:

- render time is captured as one total number in [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:24), but not split into draw, pack, readback wait, readback copy, or preview present
- frame pacing stats are recorded in [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:17), but there is no explicit visibility into how much queued playout headroom remains
- input uploads are intentionally skipped when the GL bridge is busy in [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:60), but the app does not currently surface how often that is happening

### 8a. Preview and playout are still too close together

The desktop preview is rate-limited, but still presented from inside the render pipeline path.

Relevant code:

- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:54)
- [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:235)

This means preview presentation can still consume time on the same path that is trying to meet output deadlines.

Recommended direction:

- treat preview as best-effort and entirely subordinate to playout
- move preview present to a separate presentation schedule fed from the latest completed render
- record preview skips and preview present cost independently from playout timing

### 8b. Readback is improved, but still not fully deadline-safe

The async readback path is a good step, but the miss path still falls back to synchronous `glReadPixels()` and then flushes the async pipeline.

Relevant code:

- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:150)
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:228)

That means a single late GPU fence can push the app back onto the most timing-sensitive path exactly when it is already under pressure.

Recommended direction:

- increase readback instrumentation before changing policy again
- consider deeper readback buffering or a true stale-frame reuse policy instead of immediate synchronous fallback
- separate "freshest possible frame" policy from "never miss output deadline" policy and make that tradeoff explicit

### 8c. Background control and file-watch timing are partially event-driven

`ControlServices::PollLoop()` now uses a condition-variable wakeup for queued OSC commit work and a fallback timer for compatibility polling. That removes the old fixed `25 x Sleep(10)` cadence as the default OSC commit timing model, but file-watch/runtime-store refresh work still relies on a compatibility poll path.

Relevant code:

- [ControlServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/ControlServices.cpp:217)

That is acceptable as transitional non-critical background work. The Phase 2 bridge now publishes typed reload/file-change events when changes are detected; a later file-watch implementation can replace scanning as the source.

Recommended direction:

- replace runtime-store scanning with true file-watch events when practical
- isolate truly background work from latency-sensitive control reconciliation
- add separate metrics for queue age, not just queue depth

## Phased Roadmap

This roadmap is ordered by architectural dependency rather than by “quick wins.” The goal is to move the app toward clearer ownership boundaries and safer live behavior without doing later work on top of foundations that are likely to change again.

### Phase 1. Define subsystem boundaries and target architecture

Before changing major internals, formalize the target responsibilities for each major part of the app.

Status:

- Design deliverable: complete.
- Runtime implementation foothold: complete.
- Target boundary extraction: not complete across the whole app; remaining work is tracked by later phases, especially the event model, render ownership, live-state layering, backend lifecycle, telemetry, and persistence work.

Target split:

- `RuntimeStore`
  - persisted config
  - persisted layer stack
  - preset persistence
- `RuntimeCoordinator`
  - mutation validation and classification
  - committed-live versus transient policy
  - snapshot and persistence requests
- `RuntimeSnapshotProvider`
  - render-facing immutable or near-immutable snapshots
  - parameter values prepared for the render path
- `ControlServices`
  - OSC ingress
  - web control ingress
  - reload/file-watch requests
  - commit/persist requests
- `RenderEngine`
  - sole owner of live GL rendering
  - sole consumer of render snapshots plus transient overlays
- `VideoBackend`
  - DeckLink input/output lifecycle
  - pacing and scheduling
- `HealthTelemetry`
  - logging
  - counters
  - timing traces
  - degraded-state reporting

Why this phase comes first:

- it prevents later refactors from reintroducing responsibility overlap
- it gives names to the seams the later phases will build around
- it reduces the risk of replacing one monolith with several poorly-defined ones

Suggested deliverables:

- a short architecture diagram
- a responsibility table for each subsystem
- a list of allowed dependencies between subsystems
- a dedicated Phase 1 design note:
  - [PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md)
- a subsystem design bundle index:
  - [docs/subsystems/README.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/subsystems/README.md)

Current implementation note:

The repo now has concrete runtime classes, folders, read models, and subsystem tests for the Phase 1 names. These classes are the runtime foothold for later phases; app-wide extraction still continues around eventing, render ownership, backend lifecycle, persistence, and telemetry.

### Phase 2. Introduce an internal event model

Once subsystem boundaries are defined, introduce a typed event pipeline between them. This should happen before large state splits so the app has a stable coordination model.

Dedicated design note:

- [PHASE_2_INTERNAL_EVENT_MODEL_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_2_INTERNAL_EVENT_MODEL_DESIGN.md)

Example event families:

- control events
  - `OscParameterTargeted`
  - `UiParameterCommitted`
  - `TriggerFired`
- runtime events
  - `ShaderReloadRequested`
  - `PackagesRescanned`
  - `PersistStateRequested`
- render events
  - `OverlayApplied`
  - `OverlaySettled`
  - `SnapshotPublished`
- backend events
  - `InputSignalChanged`
  - `OutputLateFrameDetected`
  - `OutputDroppedFrameDetected`
- health events
  - `SubsystemWarningRaised`
  - `SubsystemRecovered`

Why this phase comes second:

- it provides a migration path away from direct cross-calls
- it makes ownership explicit before data structures are split apart
- it lets you move one subsystem at a time without losing coordination

Suggested outcome:

- the app stops relying on “shared object plus mutex plus polling” as the default coordination pattern

### Phase 3. Finish live-state and service-facing coordination

After the event model exists, finish separating live committed state and service-facing coordination from the runtime facades.

Dedicated design note:

- [PHASE_3_LIVE_STATE_SERVICE_COORDINATION_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_3_LIVE_STATE_SERVICE_COORDINATION_DESIGN.md)

Recommended split:

- `RuntimeStore`
  - owns config and saved layer data
  - handles serialization/deserialization
  - does not sit on the live render path
- `RuntimeCoordinator`
  - resolves control actions
  - validates mutations
  - publishes new snapshots
  - bridges events between services and render
- `RuntimeSnapshotProvider`
  - publishes immutable render snapshots
  - avoids large shared mutable structures on the render path

Why this phase comes before render-thread isolation:

- render isolation is easier when the render thread consumes clean snapshots instead of a large mutable host object
- otherwise the GL refactor still drags along too much shared state complexity

Primary design rule:

- render should read snapshots
- persistence should write stored state
- services should request mutations through the coordinator

### Phase 4. Make the render thread the sole GL owner

With state and coordination cleaner, move to a dedicated render-thread model.

Dedicated design note:

- [PHASE_4_RENDER_THREAD_OWNERSHIP_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_4_RENDER_THREAD_OWNERSHIP_DESIGN.md)

Status:

- complete for GL ownership
- remaining playout-headroom work is tracked under Phase 7/backend lifecycle

Target behavior:

- one thread owns the GL context
- input callbacks never perform GL work directly
- output callbacks never perform GL work directly
- preview presentation, texture upload, render passes, readback, and output pack work are all issued by the render thread

Other threads should only:

- enqueue new video frames
- enqueue control updates
- enqueue backend events
- consume produced output buffers

Why this phase comes here:

- it is much safer once state access and control coordination are no longer centered on one shared runtime object
- it avoids coupling the render-thread refactor to storage and service refactors at the same time

Expected benefits:

- less cross-thread GL contention
- easier timing reasoning
- much lower risk of callback-driven stalls
- a clearer foundation for future GPU pipeline work

### Phase 5. Refactor live state layering into an explicit composition model

Once rendering and snapshots are isolated, formalize how final parameter values are derived.

Dedicated design note:

- [PHASE_5_LIVE_STATE_LAYERING_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_5_LIVE_STATE_LAYERING_DESIGN.md)

Recommended layers:

- base persisted state
- operator-committed live state
- transient automation overlay

Render should derive final values from a clear composition rule such as:

- `final = base + committed + transient`

Why this phase follows render isolation:

- once render owns snapshot consumption, it becomes much easier to cleanly evaluate layered state without touching persistence or control services
- it turns the current OSC overlay behavior into a first-class model instead of an implementation detail

Expected benefits:

- fewer one-off sync rules
- clearer behavior for OSC, UI changes, and automation
- easier future expansion to presets, cues, or timed transitions

### Phase 6. Move persistence onto a background snapshot writer

After the state model is explicit, persistence should become a background concern rather than a synchronous side effect of mutations.

Dedicated design note:

- [PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md)

Target behavior:

- mutations update authoritative in-memory stored state
- persistence requests are queued
- disk writes are debounced and coalesced
- writes are atomic and versioned where practical

Why this phase comes after state splitting:

- otherwise persistence logic will need to be rewritten twice
- it should operate on the new `RuntimeStore` model, not on a mixed-responsibility runtime object

Expected benefits:

- less timing interference
- better corruption resistance
- cleaner restart/recovery semantics

### Phase 7. Make DeckLink/backend lifecycle explicit with a state machine

Once the render and state layers are cleaner, refactor the video backend into an explicit lifecycle model.

Dedicated design note:

- [PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md)

Suggested states:

- uninitialized
- devices-discovered
- configured
- prerolling
- running
- degraded
- stopping
- stopped
- failed

Why this phase belongs here:

- the backend should integrate with the new event model
- degraded/recovery behavior will be easier once rendering and state coordination are already more deterministic

Expected benefits:

- safer startup/shutdown ordering
- clearer recovery behavior
- easier handling of missing input, dropped frames, or reconfiguration
- a clearer place to own playout headroom policy, output queue sizing, and late-frame recovery behavior

### Phase 8. Add structured health, telemetry, and operational reporting

This phase should happen after the main ownership changes so the telemetry can reflect the final architecture instead of a transient one.

Dedicated design note:

- [PHASE_8_HEALTH_TELEMETRY_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_8_HEALTH_TELEMETRY_DESIGN.md)

Recommended coverage:

- render queue depth
- input callback latency
- input upload skip count
- output scheduling lag
- output queue depth and spare-buffer depth
- readback timing
- readback fence wait timing
- synchronous readback fallback count
- preview present timing and skipped-preview count
- snapshot publish frequency
- persistence queue depth
- event queue depth
- backend state transitions
- warning/error counters per subsystem

Also replace modal-only error handling with:

- structured in-app health state
- severity-based logging
- rolling log files
- operator-visible degraded-state messages

Why this phase comes last:

- it should instrument the architecture you intend to keep
- otherwise instrumentation work gets invalidated by the refactor

## Recommended Execution Order

If this is approached as a serious architecture program rather than opportunistic cleanup, the recommended order is:

1. Define subsystem boundaries and target architecture.
2. Introduce the internal event model.
3. Finish runtime live-state/service coordination.
4. Make the render thread the sole GL owner.
5. Formalize live state layering and composition.
6. Move persistence to a background snapshot writer.
7. Refactor DeckLink/backend lifecycle into an explicit state machine.
8. Add structured telemetry, health reporting, and operational diagnostics.

## Why This Order Makes Sense

This order tries to avoid doing foundational work twice.

- The event model comes before major subsystem extraction so coordination patterns stabilize early.
- runtime state ownership is split before render isolation so the render thread does not inherit a monolithic state model.
- Live state layering is formalized only after render ownership is clearer.
- Persistence is moved later so it can target the final state model rather than the current one.
- Telemetry is intentionally late so it instruments the architecture that survives the refactor.

## Short Version

The app is in a much better place than it was before the OSC timing work. The shared-GL ownership risk has now been addressed by Phase 4; the main remaining live-resilience risk is output playout headroom because DeckLink callbacks still synchronously request render-thread output production. The most sensible path forward is:

1. define boundaries
2. establish an event model
3. split state ownership
4. isolate rendering
5. formalize layered live state
6. background persistence
7. explicit backend lifecycle
8. health and telemetry

That sequence gives each later phase a cleaner foundation than the current app has today.