diff --git a/docs/ARCHITECTURE_RESILIENCE_REVIEW.md b/docs/ARCHITECTURE_RESILIENCE_REVIEW.md index 8322e8a..d7f7c00 100644 --- a/docs/ARCHITECTURE_RESILIENCE_REVIEW.md +++ b/docs/ARCHITECTURE_RESILIENCE_REVIEW.md @@ -393,6 +393,8 @@ Suggested deliverables: - a short architecture diagram - a responsibility table for each subsystem - a list of allowed dependencies between subsystems +- a dedicated Phase 1 design note: + - [PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md) ### Phase 2. Introduce an internal event model diff --git a/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md b/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md new file mode 100644 index 0000000..a58284f --- /dev/null +++ b/docs/PHASE_1_SUBSYSTEM_BOUNDARIES_DESIGN.md @@ -0,0 +1,609 @@ +# Phase 1 Design: Subsystem Boundaries and Target Architecture + +This document expands Phase 1 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete target design. Its purpose is to define the long-term subsystem split before later phases introduce a full event model, split `RuntimeHost`, and move rendering onto a sole-owner render thread. + +The main goal of Phase 1 is not to immediately rewrite the app. It is to establish clear ownership boundaries so later refactors all move toward the same architecture instead of solving local problems in conflicting ways. + +## Why Phase 1 Exists + +Today the app works, but too many responsibilities still converge in a few places: + +- `RuntimeHost` owns persistence, live layer state, shader package access, status reporting, and mutation entrypoints. +- `OpenGLComposite` coordinates runtime setup, render state retrieval, shader rebuild handling, transient OSC overlay behavior, and video backend integration. +- DeckLink callback-driven playout still reaches directly into render-facing work. +- Background services rely on polling and shared mutable state more than explicit subsystem contracts. + +Those are exactly the kinds of overlaps that make timing issues, state regressions, and recovery edge cases harder to solve cleanly. + +Phase 1 creates a map for where each responsibility should eventually live. + +## Design Goals + +The target architecture should optimize for: + +- live timing isolation +- explicit state ownership +- predictable recovery behavior +- clear boundaries between persistent state and transient live state +- easier testing of non-GL and non-hardware logic +- fewer cross-thread shared mutable objects +- a playout model that can evolve toward producer/consumer scheduling + +## Non-Goals + +Phase 1 does not itself require: + +- replacing every direct call with events immediately +- moving all rendering to a new thread yet +- redesigning the shader contract again +- changing DeckLink behavior in place +- removing all existing classes before replacements exist + +This phase is the target design and the dependency rules. Later phases perform the actual extraction. + +## Current Pressure Points + +The following current code paths are the strongest evidence for the split proposed here: + +- `RuntimeHost` is both store and live authority: + - [RuntimeHost.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/runtime/RuntimeHost.h:15) + - [RuntimeHost.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/runtime/RuntimeHost.cpp:726) +- `OpenGLComposite` is both app orchestrator and render/runtime coordinator: + - [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:106) + - [OpenGLComposite.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/OpenGLComposite.cpp:283) +- `RuntimeServices` mixes service orchestration with polling and deferred state work: + - [RuntimeServices.h](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.h:46) + - [RuntimeServices.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/control/RuntimeServices.cpp:194) +- Playout is still callback-coupled to render-facing work: + - [OpenGLVideoIOBridge.cpp](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/apps/LoopThroughWithOpenGLCompositing/gl/pipeline/OpenGLVideoIOBridge.cpp:68) + +## Target Subsystems + +The long-term architecture should converge on six primary subsystems: + +1. `RuntimeStore` +2. `RuntimeCoordinator` +3. `RuntimeSnapshotProvider` +4. `ControlServices` +5. `RenderEngine` +6. `VideoBackend` +7. `HealthTelemetry` + +The split below is intentionally sharper than the current code. The point is to make ownership obvious. + +## Subsystem Responsibilities + +### `RuntimeStore` + +`RuntimeStore` owns persisted and operator-authored state. + +It is the source of truth for: + +- runtime config loaded from disk +- persisted layer stack structure +- persisted parameter values +- stack preset serialization/deserialization +- shader/package metadata that must survive across renders + +It should not be responsible for: + +- render-thread timing +- GL resource lifetime +- live transient overlays +- hardware callback coordination +- UI/websocket broadcasting policy + +Design rules: + +- disk I/O belongs here or in its dedicated writer helper +- values here are authoritative for saved state +- writes may be debounced later, but the data model itself belongs here + +### `RuntimeCoordinator` + +`RuntimeCoordinator` is the mutation and policy layer. + +It is responsible for: + +- receiving valid mutation requests from controls, services, or automation +- validating requested changes against shader definitions and config rules +- resolving how persisted state, committed live state, and transient overlays should interact +- requesting snapshot publication when state changes affect render +- requesting persistence when stored state changes + +It should not be responsible for: + +- direct disk serialization details +- direct GL work +- hardware device lifecycle +- polling loops + +Design rules: + +- all non-render mutations should eventually flow through this layer +- this layer decides whether a change is persisted, transient, or both +- this layer owns state policy, not device policy + +### `RuntimeSnapshotProvider` + +`RuntimeSnapshotProvider` publishes render-facing snapshots. + +It is responsible for: + +- building immutable or near-immutable render snapshots +- translating runtime state into render-ready structures +- publishing versioned snapshots +- serving the render side without large mutable shared locks + +It should not be responsible for: + +- deciding whether a mutation is allowed +- directly applying UI/OSC requests +- persistence +- shader compilation orchestration + +Design rules: + +- render consumes snapshots, not live mutable store objects +- snapshots should be cheap to read and explicit about version changes +- dynamic frame-only values may still be attached later, but the snapshot shape should stay stable + +### `ControlServices` + +`ControlServices` is the ingress boundary for non-render control sources. + +It is responsible for: + +- OSC receive and route resolution +- REST/websocket/control UI ingress +- file-watch or reload request ingress +- translating external inputs into typed internal actions/events +- low-cost buffering/coalescing where appropriate + +It should not be responsible for: + +- persistence decisions +- render snapshot building +- hardware playout policy +- direct long-lived state ownership beyond ingress-specific queues + +Design rules: + +- external inputs enter here and are normalized before they touch core state +- service-specific timing concerns stay here unless they affect whole-app policy +- no service should directly mutate render-facing state structures + +### `RenderEngine` + +`RenderEngine` is the owner of live rendering behavior. + +It is responsible for: + +- sole ownership of GL work in the target architecture +- shader program lifecycle once compilation outputs are available +- texture upload scheduling +- render-pass execution +- temporal history and shader feedback resources +- transient render-only overlays +- preview production as a subordinate output +- output-frame production for the video backend + +It should not be responsible for: + +- persistence +- user-facing control normalization +- hardware discovery/configuration +- high-level runtime mutation policy + +Design rules: + +- render consumes snapshots plus render-local transient state +- render-local state is allowed if it stays render-local +- preview must be treated as best-effort relative to playout + +### `VideoBackend` + +`VideoBackend` owns input/output device lifecycle and playout policy. + +It is responsible for: + +- input device configuration and callbacks +- output device configuration and callbacks +- frame scheduling policy +- buffer-pool ownership +- playout headroom policy +- input signal status +- backend state transitions and recovery logic + +It should not be responsible for: + +- composing frames +- owning GL contexts long-term +- validating shader parameter changes +- persistence + +Design rules: + +- this subsystem is the consumer of rendered output frames, not the owner of frame composition policy +- it should evolve toward producer/consumer playout rather than callback-driven rendering +- backend state should be explicit and reportable + +### `HealthTelemetry` + +`HealthTelemetry` owns structured operational visibility. + +It is responsible for: + +- logging +- warning/error counters +- timing traces +- subsystem health state +- degraded-mode reporting +- operator-visible health summaries + +It should not be responsible for: + +- deciding core app behavior +- owning render or backend state +- persistence policy + +Design rules: + +- all major subsystems publish health information here +- health visibility should outlive UI connection state +- modal dialogs should not be the main operational surface + +## Target Dependency Rules + +The architecture should follow these rules as closely as possible. + +Allowed dependency directions: + +- `ControlServices -> RuntimeCoordinator` +- `RuntimeCoordinator -> RuntimeStore` +- `RuntimeCoordinator -> RuntimeSnapshotProvider` +- `RuntimeCoordinator -> HealthTelemetry` +- `RuntimeSnapshotProvider -> RuntimeStore` +- `RenderEngine -> RuntimeSnapshotProvider` +- `RenderEngine -> HealthTelemetry` +- `VideoBackend -> RenderEngine` +- `VideoBackend -> HealthTelemetry` + +Conditionally allowed during migration: + +- `ControlServices -> HealthTelemetry` +- `ControlServices -> RuntimeStore` only through temporary compatibility shims + +Not allowed in the target design: + +- `RenderEngine -> RuntimeStore` +- `RenderEngine -> ControlServices` +- `VideoBackend -> RuntimeStore` +- `ControlServices -> RenderEngine` for direct mutation +- `RuntimeStore -> RenderEngine` +- `HealthTelemetry -> any subsystem` for control flow + +The key principle is: + +- store owns durable data +- coordinator owns mutation policy +- snapshot provider owns render-facing state publication +- render owns live GPU execution +- backend owns device timing +- telemetry observes all of them + +## State Ownership Model + +The app has several different kinds of state, and Phase 1 should name them explicitly. + +### Persisted State + +Owned by `RuntimeStore`. + +Examples: + +- layer stack structure +- selected shader ids +- saved parameter values +- runtime host config +- stack presets + +### Committed Live State + +Owned logically by `RuntimeCoordinator`, stored in the store or a live-state companion depending on future implementation. + +Examples: + +- current operator-selected parameter values +- current bypass state +- current selected shader for each layer + +This is state that should normally survive until explicitly changed and can be persisted if policy says so. + +### Transient Live Overlay State + +Owned by the subsystem that consumes it, not by the persisted store. + +Examples: + +- active OSC overlay targets while automation is flowing +- shader feedback buffers +- temporal history textures +- queued input frames +- in-flight preview state +- playout queue state + +This is where many current issues come from. The design rule is: + +- transient state may influence output +- transient state should not masquerade as persisted truth + +### Health and Timing State + +Owned by `HealthTelemetry`. + +Examples: + +- frame pacing stats +- render timing +- late/dropped frame counters +- queue depths +- warning states + +## Target Runtime Flow + +This section describes the intended long-term flow once later phases are in place. + +### Control Mutation Flow + +1. OSC/UI/file-watch input enters `ControlServices`. +2. `ControlServices` normalizes it into an internal action or event. +3. `RuntimeCoordinator` validates and classifies the action. +4. If the action changes durable state, `RuntimeStore` is updated. +5. If the action changes render-facing state, `RuntimeSnapshotProvider` publishes a new snapshot. +6. If the action requires persistence, a persistence request is queued. +7. Health/timing observations are emitted separately. + +### Render Flow + +1. `RenderEngine` consumes the latest published snapshot. +2. `RenderEngine` combines that snapshot with render-local transient state. +3. `RenderEngine` performs uploads, pass execution, feedback/history maintenance, and output production. +4. `RenderEngine` produces: + - preview-ready output + - video-backend-ready output frames + - render timing and warning signals + +### Video Output Flow + +Target long-term flow: + +1. `RenderEngine` produces completed output frames ahead of demand. +2. `VideoBackend` consumes those frames from a bounded queue or ring buffer. +3. Device callbacks only drive dequeue/schedule/accounting behavior. +4. `HealthTelemetry` records queue depth, lateness, underruns, and recovery events. + +### Reload / Shader Rebuild Flow + +1. file-watch or manual reload enters through `ControlServices` +2. `RuntimeCoordinator` classifies the reload request +3. `RuntimeStore` and shader/package metadata are refreshed if needed +4. `RuntimeSnapshotProvider` republishes affected snapshot state +5. `RenderEngine` rebuilds render-local resources from the new snapshot/build outputs + +The important boundary here is that reload is not "a render concern that also touches persistence." It is a coordinated runtime concern with a render-local execution phase. + +## Suggested Public Interfaces + +These are not final class signatures, but they show the shape the architecture should move toward. + +### `RuntimeStore` + +Core responsibilities: + +- `LoadConfig()` +- `LoadPersistentState()` +- `SavePersistentStateSnapshot(...)` +- `GetStoredLayerStack()` +- `SetStoredLayerStack(...)` +- `GetStackPresetNames()` +- `SaveStackPreset(...)` +- `LoadStackPreset(...)` + +### `RuntimeCoordinator` + +Core responsibilities: + +- `ApplyControlMutation(...)` +- `ApplyAutomationTarget(...)` +- `ResetLayer(...)` +- `RequestReload(...)` +- `CommitOverlayState(...)` +- `PublishSnapshotIfNeeded()` +- `RequestPersistenceIfNeeded()` + +### `RuntimeSnapshotProvider` + +Core responsibilities: + +- `BuildSnapshot(...)` +- `GetLatestSnapshot()` +- `GetSnapshotVersion()` +- `PublishSnapshot(...)` + +### `ControlServices` + +Core responsibilities: + +- `StartOscIngress(...)` +- `StartWebControlIngress(...)` +- `StartFileWatchIngress(...)` +- `EnqueueControlAction(...)` +- `DrainServiceEvents(...)` + +### `RenderEngine` + +Core responsibilities: + +- `StartRenderLoop(...)` +- `ConsumeSnapshot(...)` +- `EnqueueInputFrame(...)` +- `ProduceOutputFrame(...)` +- `ResetRenderLocalState(...)` +- `HandleRebuildOutputs(...)` + +### `VideoBackend` + +Core responsibilities: + +- `ConfigureInput(...)` +- `ConfigureOutput(...)` +- `StartPlayout(...)` +- `StopPlayout(...)` +- `ConsumeRenderedFrame(...)` +- `ReportBackendState(...)` + +### `HealthTelemetry` + +Core responsibilities: + +- `RecordTimingSample(...)` +- `RecordCounterDelta(...)` +- `RaiseWarning(...)` +- `ClearWarning(...)` +- `AppendLogEntry(...)` +- `BuildHealthSnapshot()` + +## Mapping From Current Code to Target Subsystems + +This is not a one-to-one rename plan. It is a responsibility migration map. + +### Current `RuntimeHost` + +Should eventually split across: + +- `RuntimeStore` +- `RuntimeCoordinator` +- `RuntimeSnapshotProvider` +- parts of `HealthTelemetry` + +Likely examples: + +- config loading/saving -> `RuntimeStore` +- layer stack mutation validation -> `RuntimeCoordinator` +- render state building/versioning -> `RuntimeSnapshotProvider` +- timing/status setters -> `HealthTelemetry` + +### Current `RuntimeServices` + +Should eventually become mostly: + +- `ControlServices` +- a small service-hosting shell + +Likely examples: + +- OSC ingress/coalescing -> `ControlServices` +- file-watch ingress -> `ControlServices` +- deferred service coordination now done by polling -> split between `ControlServices` and event-driven coordinator calls + +### Current `OpenGLComposite` + +Should eventually split across: + +- application bootstrap shell +- `RenderEngine` +- orchestration glue that wires subsystems together + +Likely examples: + +- render-pass facing code -> `RenderEngine` +- app/service/backend bootstrap -> composition root +- runtime mutation API surface -> coordinator-facing adapter, not render owner + +### Current `OpenGLVideoIOBridge` and `DeckLinkSession` + +Should eventually align more clearly under: + +- `VideoBackend` +- `RenderEngine` + +Likely examples: + +- device callback and scheduling policy -> `VideoBackend` +- GL upload/readback/render work -> `RenderEngine` + +## Architectural Guardrails + +As later phases begin, these rules should be treated as guardrails. + +### 1. No new cross-cutting state should be added to `RuntimeHost` + +If a new feature needs durable state, place it conceptually under `RuntimeStore`. +If it needs render-local transient state, place it conceptually under `RenderEngine`. +If it needs timing/status counters, place it conceptually under `HealthTelemetry`. + +### 2. Render-local state should stay render-local + +Do not push shader feedback, temporal history, preview caches, or playout queues back into the store just to make them easy to reach from other systems. + +### 3. Device callbacks should not become a dumping ground for app work + +Callback threads should converge toward signaling and queue management, not core rendering, persistence, or control mutation. + +### 4. Persistence should not be used as a control synchronization mechanism + +Saving state is not how subsystems discover changes. Published snapshots and explicit events should handle that. + +### 5. Health reporting should observe, not coordinate + +Telemetry systems may record warnings and degraded states, but they should not become the hidden control plane for the app. + +## Migration Strategy + +Phase 1 is a design phase, but it should support incremental migration. + +Recommended order after this document: + +1. Introduce names and interfaces before moving logic. +2. Create compatibility adapters around `RuntimeHost` rather than forcing a flag day. +3. Move read-only render snapshot publication out before moving all mutation logic. +4. Move service ingress boundaries out before removing the old polling shell. +5. Isolate timing/health setters from the core store as early as practical. + +This keeps progress measurable while reducing rewrite risk. + +## Suggested Deliverables for Completing Phase 1 + +Phase 1 can reasonably be considered complete once the project has: + +- this subsystem-boundary design document +- agreed subsystem names and responsibilities +- agreed allowed dependency directions +- explicit state categories: persisted, committed live, transient overlay, health/timing +- a current-to-target responsibility map for `RuntimeHost`, `RuntimeServices`, `OpenGLComposite`, and backend/render bridge code +- a decision that later phases will build against this target rather than inventing new boundaries ad hoc + +## Open Questions For Later Phases + +These do not block Phase 1, but they should remain visible. + +- Should shader package registry ownership live entirely in `RuntimeStore`, or should compile-ready derived registry data move into the snapshot provider? +- Should committed live state be stored directly in `RuntimeStore`, or split into store plus live-session state owned by the coordinator? +- How much of shader build orchestration belongs to `RenderEngine` versus a separate build service? +- At what phase should preview become fully decoupled from playout cadence? +- Should persistence become its own `PersistenceWriter` subsystem in Phase 6, or remain an implementation detail under `RuntimeStore`? + +## Short Version + +Phase 1 should establish one simple rule for the rest of the refactor: + +- durable state lives in the store +- mutation policy lives in the coordinator +- render-facing state is published as snapshots +- external control sources enter through services +- GL work belongs to render +- hardware pacing belongs to the backend +- health visibility belongs to telemetry + +If later phases keep to that rule, the architecture will become materially more resilient without needing another round of foundational boundary changes.