Files
video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md
Aiden 7777cfc194
All checks were successful
CI / React UI Build (push) Successful in 11s
CI / Native Windows Build And Tests (push) Successful in 2m23s
CI / Windows Release Package (push) Successful in 2m46s
data storage
2026-05-10 20:39:28 +10:00

457 lines
16 KiB
Markdown

# Architecture Resilience Review
This note summarizes the main architectural improvements that would make the app more resilient during live use, especially around timing isolation, failure isolation, and recoverability.
## Key Findings
### 1. `RuntimeHost` is carrying too many responsibilities
`RuntimeHost` currently acts as:
- config store
- persistent state store
- live parameter/state authority
- shader package registry owner
- status/telemetry sink
- control mutation entrypoint
That makes it a single contention and failure domain. It is also why OSC and render timing issues repeatedly surfaced around shared state access.
Relevant code:
- [RuntimeHost.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/runtime/RuntimeHost.h:15)
Recommended direction:
- split persisted config/state from live render-facing state
- separate status/telemetry updates from control mutation paths
- make render consume snapshots rather than sharing a large mutable authority object
### 2. OpenGL ownership is still centralized behind one shared lock
Even after recent timing improvements, preview, input upload, and playout rendering still rely on one shared GL context protected by one `CRITICAL_SECTION`.
Relevant code:
- [OpenGLComposite.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.h:93)
- [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:253)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:70)
This is still a central choke point and limits timing isolation.
Recommended direction:
- use one dedicated render thread as the sole GL owner
- have input/output/control threads queue work instead of performing GL work directly
- remove ad hoc GL use from callback threads
### 3. Control flow is spread across polling and shared-memory patterns
`RuntimeServices` currently mixes:
- file polling
- deferred OSC commit handling
- control service orchestration
OSC ingest, overlay application, and host sync are distributed across several components.
Relevant code:
- [RuntimeServices.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.h:26)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:178)
Recommended direction:
- introduce a small internal event pipeline or message bus
- use typed events for OSC, reloads, persistence requests, and status changes
- make timing ownership explicit per subsystem
Example event types:
- `OscParameterTargeted`
- `RenderOverlaySettled`
- `PersistStateRequested`
- `ShaderReloadRequested`
- `DeckLinkStatusChanged`
### 4. Error handling is still heavily UI-coupled
Failures are often surfaced via `MessageBoxA`, while background services mainly log with `OutputDebugStringA`.
Relevant code:
- [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:314)
- [DeckLinkSession.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.cpp:478)
- [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:205)
This is not ideal for a live system where modal dialogs and silent debug logging are both poor operational behavior.
Recommended direction:
- introduce structured in-app error reporting
- define severity levels and counters
- prefer degraded runtime states over modal failure handling where possible
- add a rolling log file for operational troubleshooting
### 5. Live OSC overlay and persisted state are still separate concepts without a formal model
The current design works better now, but it still relies on hand-managed reconciliation between:
- persisted parameter state in `RuntimeHost`
- transient OSC overlay state in `OpenGLComposite`
Relevant code:
- [OpenGLComposite.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.h:66)
Recommended direction:
Formalize three layers of state:
- base persisted state
- operator/UI committed state
- transient live automation overlay
Then render can always resolve:
- `final = base + committed + transient`
That avoids special-case sync behavior becoming scattered across the code.
### 6. DeckLink lifecycle could be modeled more explicitly
`DeckLinkSession` has a number of imperative calls, but startup, preroll, running, degraded, and stopped are not represented as an explicit state machine.
Relevant code:
- [DeckLinkSession.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/videoio/decklink/DeckLinkSession.h:17)
Recommended direction:
- introduce explicit session states
- define allowed transitions
- centralize recovery behavior
- make shutdown ordering and degraded-mode behavior more predictable
### 7. Persistence should be more asynchronous and debounced
`SavePersistentState()` is still called directly from many update paths.
Relevant code:
- [RuntimeHost.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/runtime/RuntimeHost.cpp:1841)
Recent OSC work already reduced this problem for live automation, but the broader architecture would still benefit from:
- a debounced persistence queue
- atomic write-behind snapshots
- clear separation between state mutation and disk flush
This improves both resilience and timing safety.
### 8. Telemetry is useful, but still too coarse
The app already records render timing and playout pacing, which is a good foundation.
Relevant code:
- [OpenGLRenderPipeline.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLRenderPipeline.cpp:24)
- [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:24)
Recommended direction:
Add lightweight tracing for:
- input callback latency
- GL lock wait time
- render time
- readback time
- output scheduling lag
- control queue depth
- `RuntimeHost` lock contention
That would make future tuning and failure diagnosis much easier.
## Phased Roadmap
This roadmap is ordered by architectural dependency rather than by “quick wins.” The goal is to move the app toward clearer ownership boundaries and safer live behavior without doing later work on top of foundations that are likely to change again.
### Phase 1. Define subsystem boundaries and target architecture
Before changing major internals, formalize the target responsibilities for each major part of the app.
Target split:
- `RuntimeStore`
- persisted config
- persisted layer stack
- preset persistence
- `RuntimeSnapshot`
- render-facing immutable or near-immutable snapshots
- parameter values prepared for the render path
- `ControlServices`
- OSC ingress
- web control ingress
- reload/file-watch requests
- commit/persist requests
- `RenderEngine`
- sole owner of live GL rendering
- sole consumer of render snapshots plus transient overlays
- `VideoBackend`
- DeckLink input/output lifecycle
- pacing and scheduling
- `Health/Telemetry`
- logging
- counters
- timing traces
- degraded-state reporting
Why this phase comes first:
- it prevents later refactors from reintroducing responsibility overlap
- it gives names to the seams the later phases will build around
- it reduces the risk of replacing one monolith with several poorly-defined ones
Suggested deliverables:
- a short architecture diagram
- a responsibility table for each subsystem
- a list of allowed dependencies between subsystems
### Phase 2. Introduce an internal event model
Once subsystem boundaries are defined, introduce a typed event pipeline between them. This should happen before large state splits so the app has a stable coordination model.
Example event families:
- control events
- `OscParameterTargeted`
- `UiParameterCommitted`
- `TriggerFired`
- runtime events
- `ShaderReloadRequested`
- `PackagesRescanned`
- `PersistStateRequested`
- render events
- `OverlayApplied`
- `OverlaySettled`
- `SnapshotPublished`
- backend events
- `InputSignalChanged`
- `OutputLateFrameDetected`
- `OutputDroppedFrameDetected`
- health events
- `SubsystemWarningRaised`
- `SubsystemRecovered`
Why this phase comes second:
- it provides a migration path away from direct cross-calls
- it makes ownership explicit before data structures are split apart
- it lets you move one subsystem at a time without losing coordination
Suggested outcome:
- the app stops relying on “shared object plus mutex plus polling” as the default coordination pattern
### Phase 3. Split `RuntimeHost` into persistent state, render snapshot state, and service-facing coordination
After the event model exists, break apart `RuntimeHost`.
Recommended split:
- `RuntimeStore`
- owns config and saved layer data
- handles serialization/deserialization
- does not sit on the live render path
- `RuntimeCoordinator`
- resolves control actions
- validates mutations
- publishes new snapshots
- bridges events between services and render
- `RuntimeSnapshotProvider`
- publishes immutable render snapshots
- avoids large shared mutable structures on the render path
Why this phase comes before render-thread isolation:
- render isolation is easier when the render thread consumes clean snapshots instead of a large mutable host object
- otherwise the GL refactor still drags along too much shared state complexity
Primary design rule:
- render should read snapshots
- persistence should write stored state
- services should request mutations through the coordinator
### Phase 4. Make the render thread the sole GL owner
With state and coordination cleaner, move to a dedicated render-thread model.
Target behavior:
- one thread owns the GL context
- input callbacks never perform GL work directly
- output callbacks never perform GL work directly
- preview presentation, texture upload, render passes, readback, and output pack work are all issued by the render thread
Other threads should only:
- enqueue new video frames
- enqueue control updates
- enqueue backend events
- consume produced output buffers
Why this phase comes here:
- it is much safer once state access and control coordination are no longer centered on `RuntimeHost`
- it avoids coupling the render-thread refactor to storage and service refactors at the same time
Expected benefits:
- less cross-thread GL contention
- easier timing reasoning
- much lower risk of callback-driven stalls
- a clearer foundation for future GPU pipeline work
### Phase 5. Refactor live state layering into an explicit composition model
Once rendering and snapshots are isolated, formalize how final parameter values are derived.
Recommended layers:
- base persisted state
- operator-committed live state
- transient automation overlay
Render should derive final values from a clear composition rule such as:
- `final = base + committed + transient`
Why this phase follows render isolation:
- once render owns snapshot consumption, it becomes much easier to cleanly evaluate layered state without touching persistence or control services
- it turns the current OSC overlay behavior into a first-class model instead of an implementation detail
Expected benefits:
- fewer one-off sync rules
- clearer behavior for OSC, UI changes, and automation
- easier future expansion to presets, cues, or timed transitions
### Phase 6. Move persistence onto a background snapshot writer
After the state model is explicit, persistence should become a background concern rather than a synchronous side effect of mutations.
Target behavior:
- mutations update authoritative in-memory stored state
- persistence requests are queued
- disk writes are debounced and coalesced
- writes are atomic and versioned where practical
Why this phase comes after state splitting:
- otherwise persistence logic will need to be rewritten twice
- it should operate on the new `RuntimeStore` model, not on the current mixed-responsibility object
Expected benefits:
- less timing interference
- better corruption resistance
- cleaner restart/recovery semantics
### Phase 7. Make DeckLink/backend lifecycle explicit with a state machine
Once the render and state layers are cleaner, refactor the video backend into an explicit lifecycle model.
Suggested states:
- uninitialized
- devices-discovered
- configured
- prerolling
- running
- degraded
- stopping
- stopped
- failed
Why this phase belongs here:
- the backend should integrate with the new event model
- degraded/recovery behavior will be easier once rendering and state coordination are already more deterministic
Expected benefits:
- safer startup/shutdown ordering
- clearer recovery behavior
- easier handling of missing input, dropped frames, or reconfiguration
### Phase 8. Add structured health, telemetry, and operational reporting
This phase should happen after the main ownership changes so the telemetry can reflect the final architecture instead of a transient one.
Recommended coverage:
- render queue depth
- GL lock wait time, if any shared lock remains
- input callback latency
- output scheduling lag
- readback timing
- snapshot publish frequency
- persistence queue depth
- event queue depth
- backend state transitions
- warning/error counters per subsystem
Also replace modal-only error handling with:
- structured in-app health state
- severity-based logging
- rolling log files
- operator-visible degraded-state messages
Why this phase comes last:
- it should instrument the architecture you intend to keep
- otherwise instrumentation work gets invalidated by the refactor
## Recommended Execution Order
If this is approached as a serious architecture program rather than opportunistic cleanup, the recommended order is:
1. Define subsystem boundaries and target architecture.
2. Introduce the internal event model.
3. Split `RuntimeHost`.
4. Make the render thread the sole GL owner.
5. Formalize live state layering and composition.
6. Move persistence to a background snapshot writer.
7. Refactor DeckLink/backend lifecycle into an explicit state machine.
8. Add structured telemetry, health reporting, and operational diagnostics.
## Why This Order Makes Sense
This order tries to avoid doing foundational work twice.
- The event model comes before major subsystem extraction so coordination patterns stabilize early.
- `RuntimeHost` is split before render isolation so the render thread does not inherit the current monolithic state model.
- Live state layering is formalized only after render ownership is clearer.
- Persistence is moved later so it can target the final state model rather than the current one.
- Telemetry is intentionally late so it instruments the architecture that survives the refactor.
## Short Version
The app is in a much better place than it was before the OSC timing work, but the main remaining architectural risk is still shared ownership. Too many responsibilities converge on `RuntimeHost` and the shared GL path. The most sensible path forward is:
1. define boundaries
2. establish an event model
3. split state ownership
4. isolate rendering
5. formalize layered live state
6. background persistence
7. explicit backend lifecycle
8. health and telemetry
That sequence gives each later phase a cleaner foundation than the current app has today.