17 KiB
Phase 2 Design: Internal Event Model
This document expands Phase 2 of ARCHITECTURE_RESILIENCE_REVIEW.md into a concrete design target.
Phase 1 established the subsystem vocabulary and moved the runtime path behind clearer collaborators. Phase 2 should now give those subsystems a safer way to coordinate than direct cross-calls, shared mutable result queues, and coarse polling loops.
Status
- Phase 2 design package: proposed.
- Phase 2 implementation: not started.
The current repo already has useful footholds:
ControlServicesowns OSC/web/file-watch ingress and queues service-side work.RuntimeCoordinatorowns mutation validation, classification, and coordinator result policy.RuntimeUpdateControllerapplies coordinator outcomes and bridges toward render, shader builds, broadcasts, and backend state.RuntimeSnapshotProviderpublishes render-facing snapshots.HealthTelemetryowns status/timing snapshots.
Those are good boundaries. The Phase 2 job is to stop using "poll, drain, then interpret side effects" as the main coordination style between them.
Why Phase 2 Exists
The resilience review calls out three timing and ownership problems that an event model can directly improve:
- background service timing still relies on coarse sleeps and polling
- control, reload, persistence, and render-update work still travel through mixed shared state and result queues
- later render/backend refactors need a stable coordination model before they move more work across threads
The goal is not to make the app fully asynchronous in one pass. It is to introduce typed internal events so each subsystem can publish what happened without knowing who will react or how many downstream effects are needed.
Goals
Phase 2 should establish:
- a small typed event vocabulary for control, runtime, render, backend, persistence, and health coordination
- one app-owned event pump or dispatcher that can route events deterministically
- bounded queues with clear ownership and no unbounded background growth
- wakeup-driven service coordination where practical, replacing coarse polling as the default shape
- explicit event-to-command boundaries so events do not become hidden global mutation APIs
- tests for event ordering, coalescing, rejection, and dispatch side effects
Non-Goals
Phase 2 should not require:
- a dedicated render thread yet
- a full actor system
- lock-free queues everywhere
- background persistence implementation
- a complete DeckLink state machine
- final live-state layering
- replacing every direct call in one change
Those are later phases. Phase 2 provides the coordination substrate they can build on.
Current Coordination Shape
The current runtime is much cleaner than before Phase 1, but coordination is still mostly pull-based:
ControlServices::PollLoop(...)drains pending OSC commits, polls runtime file changes, queuesRuntimeCoordinatorResultobjects, then sleeps.RuntimeUpdateController::ProcessRuntimeWork()consumes queued coordinator results, applies them, and then checks whether a prepared shader build is ready.RuntimeCoordinatorResultcarries many downstream effects: shader build request, compile status update, transient OSC clear, runtime-state broadcast, committed-state mode, render reset scope.- shader-build readiness is polled from the app update path.
- runtime-state broadcasts are requested by direct calls rather than by an event publication contract.
This works, but it keeps timing behavior implicit. Phase 2 should make those transitions visible as typed events.
Event Model Principles
Events say what happened
Events should describe facts:
OscValueReceivedRuntimeMutationAcceptedRuntimeMutationRejectedShaderReloadRequestedShaderBuildPreparedShaderBuildFailedRenderSnapshotPublishedRuntimeStateBroadcastRequested
They should not be vague commands like "do everything needed now."
Commands request intent
Some work is still naturally command-shaped:
- "apply this parameter mutation"
- "request shader reload"
- "save this stack preset"
- "start backend output"
Commands enter an owner subsystem. Events leave a subsystem after the owner has accepted, rejected, or completed work.
One owner mutates each state category
Events must not become a way to bypass Phase 1 ownership:
RuntimeCoordinatorremains the owner of mutation policy.RuntimeStoreremains the owner of durable state.RuntimeSnapshotProviderremains the owner of render snapshot publication.RenderEngineremains the owner of render-local transient state.VideoBackendremains the owner of device lifecycle and pacing.HealthTelemetryobserves and reports, but does not coordinate behavior.
Event handlers should be small
Handlers should translate events into owner calls or follow-up events. They should not accumulate hidden long-lived state unless that state belongs to the handler's subsystem.
Queues must be bounded or coalesced
High-rate control traffic can arrive faster than the app should process every individual sample. Phase 2 should preserve the useful current behavior of coalescing OSC updates by route, but make the coalescing policy explicit.
Event Families
Control Events
Produced by ControlServices.
Examples:
OscValueReceivedOscValueCoalescedOscCommitRequestedHttpControlMutationRequestedWebSocketClientConnectedRuntimeStateBroadcastRequestedFileChangeDetectedManualReloadRequested
Primary consumers:
RuntimeCoordinatorHealthTelemetry- later, a persistence writer or diagnostics publisher
Runtime Events
Produced by RuntimeCoordinator, RuntimeStore, and snapshot publication code.
Examples:
RuntimeMutationAcceptedRuntimeMutationRejectedRuntimeStateChangedRuntimePersistenceRequestedRuntimeReloadRequestedShaderPackagesChangedRenderSnapshotPublishRequestedRuntimeStatePresentationChanged
Primary consumers:
RuntimeSnapshotProviderRenderEngineControlServicesHealthTelemetry- later,
PersistenceWriter
Shader Build Events
Produced by shader build orchestration and render-side build application.
Examples:
ShaderBuildRequestedShaderBuildPreparedShaderBuildAppliedShaderBuildFailedCompileStatusChanged
Primary consumers:
RenderEngineRuntimeCoordinatorControlServicesHealthTelemetry
Render Events
Produced by RenderEngine and RuntimeSnapshotProvider.
Examples:
RenderSnapshotPublishedRenderResetRequestedRenderResetAppliedOscOverlayAppliedOscOverlaySettledFrameRenderedPreviewFrameAvailable
Primary consumers:
RenderEngineControlServicesVideoBackendHealthTelemetry
Backend Events
Produced by VideoBackend and backend adapters.
Examples:
InputSignalChangedInputFrameArrivedOutputFrameScheduledOutputFrameCompletedOutputLateFrameDetectedOutputDroppedFrameDetectedBackendStateChanged
Primary consumers:
RenderEngineHealthTelemetry- later, backend lifecycle state machine handlers
Health Events
Produced by all major subsystems.
Examples:
SubsystemWarningRaisedSubsystemWarningClearedSubsystemRecoveredTimingSampleRecordedQueueDepthChanged
Primary consumer:
HealthTelemetry
Health events should be observational. They should not be required for core behavior to proceed.
Event Envelope
A practical initial event envelope can stay simple:
enum class RuntimeEventType
{
OscCommitRequested,
RuntimeMutationAccepted,
RuntimeMutationRejected,
RuntimeReloadRequested,
ShaderBuildRequested,
ShaderBuildPrepared,
ShaderBuildFailed,
RenderSnapshotPublishRequested,
RenderSnapshotPublished,
RuntimeStateBroadcastRequested,
BackendStateChanged,
SubsystemWarningRaised
};
struct RuntimeEvent
{
RuntimeEventType type;
uint64_t sequence = 0;
std::chrono::steady_clock::time_point createdAt;
std::string source;
std::variant<
OscCommitRequestedEvent,
RuntimeMutationEvent,
ShaderBuildEvent,
RenderSnapshotEvent,
BackendEvent,
HealthEvent> payload;
};
The exact C++ names can change. The key design requirements are:
- event type is explicit
- event order is observable
- source subsystem is recorded
- payload is typed, not a bag of optional strings
- timestamps exist for queue-age telemetry
- failures are events too, not just debug strings
Event Bus Shape
Phase 2 does not need a large framework. A small app-owned dispatcher is enough.
Suggested components:
RuntimeEventBus- owns queues
- assigns sequence numbers
- exposes
Publish(...) - exposes
Drain(...)orDispatchPending(...)
RuntimeEventHandler- narrow handler interface or function callback
- registered by subsystem/composition root
RuntimeEventQueue- bounded FIFO for ordinary events
- coalesced map for latest-value events such as high-rate OSC
RuntimeEventMetrics- queue depth
- oldest event age
- dropped/coalesced counts
Initial implementation can be single-process and mostly single-dispatch-thread. The important part is that event publication and event handling become explicit.
Queue Policy
Not every event deserves the same queue semantics.
FIFO Events
Use FIFO for events where every item matters:
- mutation accepted/rejected
- shader build completed/failed
- backend state changed
- warning raised/cleared
Coalesced Events
Use coalescing for high-rate latest-value flows:
- OSC parameter target updates by route
- runtime-state broadcast requests
- file-change reload requests during a burst
- queue-depth telemetry
Coalesced events should record how many updates were collapsed so telemetry can show pressure.
Synchronous Boundaries
Some calls may remain synchronous during Phase 2:
- UI/API mutation calls that need an immediate success/error response
- startup configuration failures
- shutdown ordering
- tests
The rule is that synchronous calls should still publish events for accepted/rejected/completed work, so the rest of the app does not need to infer side effects from the call path.
Target Flow Examples
OSC Parameter Update
OscServerdecodes a packet.ControlServicespublishes or coalescesOscValueReceived.- The dispatcher routes the event to the render-overlay path or coordinator policy, depending on whether the value is transient or committing.
RuntimeCoordinatorpublishesRuntimeMutationAcceptedorRuntimeMutationRejectedfor committed changes.- Accepted committed changes publish
RenderSnapshotPublishRequestedandRuntimePersistenceRequestedas needed. ControlServicesreceivesRuntimeStateBroadcastRequestedor a presentation-changed event and broadcasts at its own cadence.
File Reload
- File-watch or manual reload produces
FileChangeDetectedorManualReloadRequested. ControlServicescoalesces reload bursts into oneRuntimeReloadRequested.RuntimeCoordinatorclassifies the reload.- Package/store refresh produces
ShaderPackagesChangedif package metadata changed. - Coordinator publishes
ShaderBuildRequested. - Shader build completion publishes
ShaderBuildPreparedorShaderBuildFailed. - Render applies the ready build and publishes
ShaderBuildApplied.
Runtime State Broadcast
- A mutation or reload publishes
RuntimeStatePresentationChanged. ControlServicescoalesces this into a broadcast request.- The broadcast path asks
RuntimeStatePresenterfor the current presentation read model. HealthTelemetryrecords broadcast count, failures, and queue age.
Backend Signal Change
- Backend adapter detects input signal change.
VideoBackendpublishesInputSignalChanged.HealthTelemetryrecords the new signal status.- Later phases may let the backend lifecycle state machine react to the same event.
Migration Plan
Step 1. Add Event Types And A Minimal Dispatcher
Introduce:
RuntimeEventRuntimeEventType- typed payload structs for the smallest useful event family
RuntimeEventBusor equivalent dispatcher
Start with events that do not change behavior:
RuntimeStateBroadcastRequestedShaderBuildRequestedRuntimeMutationRejected- simple health/log observations
Step 2. Convert RuntimeUpdateController Into An Event Handler
RuntimeUpdateController is already close to an event effect applier. Phase 2 should narrow it into a handler for:
- coordinator outcome events
- shader build readiness events
- snapshot publication requests
- broadcast requests
The class should stop being the place that polls every source of work.
Step 3. Replace ControlServices::PollLoop Sleep With Wakeups
Keep coalescing, but replace the fixed 25 x Sleep(10) cadence with:
- a condition variable or waitable event
- wakeups when OSC commit work arrives
- wakeups when file/reload work arrives
- a fallback timer only for compatibility polling that cannot yet be evented
This is the most direct Phase 2 timing win.
Step 4. Route Shader Build Lifecycle Through Events
Turn the current request/apply/failure/success path into explicit events:
ShaderBuildRequestedShaderBuildPreparedShaderBuildFailedShaderBuildAppliedCompileStatusChanged
This should preserve the current off-frame-path compile behavior while making readiness visible.
Step 5. Route Runtime Broadcasts Through Events
Replace direct "broadcast now" decisions with:
RuntimeStatePresentationChangedRuntimeStateBroadcastRequestedRuntimeStateBroadcastCompletedRuntimeStateBroadcastFailed
This keeps UI delivery in ControlServices while keeping presentation ownership in the runtime presentation layer.
Step 6. Add Event Metrics
Before using the event system for hotter paths, add metrics:
- event queue depth
- oldest event age
- event dispatch duration
- coalesced event count
- dropped event count
- handler failure count
These should feed HealthTelemetry.
Dependency Rules
Allowed:
- producers publish events to the bus
- the composition root registers handlers
- handlers call owner subsystem APIs
HealthTelemetryobserves event metrics and failures
Avoid:
- subsystems subscribing directly to each other in constructors
- event handlers mutating state outside their owner subsystem
- using one global event payload with many nullable fields
- making render hot paths block on the event bus
- requiring health/telemetry event delivery for core behavior
The dispatcher is coordination infrastructure, not a new domain owner.
Testing Strategy
Phase 2 should add tests that do not require GL, DeckLink, or network sockets.
Recommended tests:
- FIFO events dispatch in sequence order
- coalesced events keep the latest payload and count collapsed updates
- rejected mutations publish rejection events without downstream snapshot/build events
- accepted parameter mutations publish the expected follow-up event set
- file reload bursts collapse into one reload request
- handler failures are reported as health/log events
- queue depth and oldest-event-age metrics update predictably
The existing runtime subsystem tests are a good home for the first pure event model tests, or a new RuntimeEventTests.cpp target can be added if the event layer grows enough.
Phase 2 Exit Criteria
Phase 2 can be considered complete once the project can say:
- there is a typed internal event envelope and dispatcher
ControlServicesemits typed events for OSC commits, broadcast requests, and reload/file-change workRuntimeCoordinatorpublishes explicit accepted/rejected/follow-up events instead of callers interpreting broad result objects everywhereRuntimeUpdateControllerhandles events rather than polling all runtime work sources directly- shader build request/readiness/failure/application is represented as events
- runtime-state broadcasts are event-driven and coalesced
- event queues expose depth, age, coalescing, and failure metrics
- coarse sleep polling is no longer the default coordination model for service work
Open Questions For Implementation
- Should the first dispatcher be single-threaded and pumped by the app loop, or should
ControlServicesown a dedicated service event thread? - Should high-rate OSC transient overlay events go through the same bus, or should only commit/settle events enter the bus initially?
- Should event payloads use
std::variant, type-erased handlers, or separate strongly typed queues per family? - How much of
RuntimeCoordinatorResultshould survive as an internal helper versus being replaced by explicit events? - Should persistence requests be represented in Phase 2 even though the background writer lands later?
- Should backend callback events be introduced now as observation-only events, or wait until the backend state-machine phase?
Short Version
Phase 2 should give the app a typed nervous system.
- external inputs become typed events
- owner subsystems still make decisions
- decisions publish explicit outcomes
- follow-up work is routed by handlers, not inferred from scattered call paths
- high-rate work is bounded or coalesced
- timing and queue pressure become observable
If this boundary holds, later render-thread, persistence, backend, and telemetry work can move independently without returning to shared-object polling as the default coordination model.