Files
video-shader-toys/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md
Aiden e8a3805fff
Some checks failed
CI / React UI Build (push) Successful in 10s
CI / Native Windows Build And Tests (push) Successful in 2m40s
CI / Windows Release Package (push) Has been cancelled
Doc update again
2026-05-11 18:48:55 +10:00

10 KiB

Phase 7 Design: Backend Lifecycle And Playout

This document expands Phase 7 of ARCHITECTURE_RESILIENCE_REVIEW.md into a concrete design target.

Phase 4 made the render thread the sole owner of normal runtime GL work, but output timing is still callback-coupled: DeckLink completion callbacks synchronously request render-thread output production before scheduling the next hardware frame. Phase 7 should make backend lifecycle, buffer policy, playout headroom, and recovery explicit.

Status

  • Phase 7 design package: proposed.
  • Phase 7 implementation: not started.
  • Current alignment: VideoBackend, VideoIODevice, DeckLinkSession, and VideoPlayoutScheduler exist. Phase 4 removed callback-thread GL ownership, but the DeckLink completion path still waits for render-thread output production.

Current backend footholds:

  • VideoBackend wraps device discovery/configuration, start/stop, input callback handling, output completion handling, and telemetry publication.
  • DeckLinkSession owns DeckLink device handles, frame pool creation, preroll, keyer configuration, and scheduled playback.
  • VideoPlayoutScheduler owns basic schedule time generation and simple late/drop skip-ahead behavior.
  • OpenGLVideoIOBridge is the current adapter between VideoBackend and RenderEngine.
  • HealthTelemetry receives some signal, render, and pacing stats.

Why Phase 7 Exists

The current output path works only while render/readback stays comfortably inside budget. A late render can make the callback late, which reduces device-side headroom, which makes the next callback more fragile.

The resilience review calls this the main remaining live-resilience risk after Phase 4:

  • output playout is still effectively render-on-demand from the DeckLink completion callback
  • buffer pool size and preroll depth are not sourced from one policy
  • late/dropped recovery is a fixed skip rule
  • backend lifecycle is imperative rather than represented as explicit states

Phase 7 should separate hardware timing from render production.

Goals

Phase 7 should establish:

  • explicit backend lifecycle states and allowed transitions
  • one playout policy for frame pool size, preroll, headroom, and underrun behavior
  • a bounded producer/consumer output queue between render and DeckLink scheduling
  • lightweight DeckLink callbacks that dequeue/schedule/account rather than render
  • measured recovery from late/dropped frames
  • structured backend health reporting
  • tests for scheduler, queue, lifecycle, and underrun policy without DeckLink hardware

Non-Goals

Phase 7 should not require:

  • a new renderer
  • changing shader/state composition
  • replacing DeckLink support with multiple backends
  • full telemetry UI redesign
  • removing every synchronous API immediately
  • perfect adaptive latency policy in the first pass

Target Timing Model

The target model is producer/consumer playout:

RenderEngine/render scheduler produces completed output frames
  -> bounded ready-frame queue
  -> VideoBackend consumes ready frames
  -> DeckLink callback schedules already-prepared frames

The callback should not wait for rendering. It should:

  • record completion result
  • recycle/release completed buffers
  • dequeue a ready frame or apply underrun policy
  • schedule the next frame
  • publish backend timing/health observations

Target Lifecycle Model

Suggested backend states:

  1. Uninitialized
  2. Discovering
  3. Discovered
  4. Configuring
  5. Configured
  6. Prerolling
  7. Running
  8. Degraded
  9. Stopping
  10. Stopped
  11. Failed

Suggested transition rules:

  • Uninitialized -> Discovering
  • Discovering -> Discovered | Failed
  • Discovered -> Configuring | Stopped
  • Configuring -> Configured | Failed
  • Configured -> Prerolling | Stopped
  • Prerolling -> Running | Failed | Stopping
  • Running -> Degraded | Stopping | Failed
  • Degraded -> Running | Stopping | Failed
  • Stopping -> Stopped

The exact enum can change, but the lifecycle should become observable and testable.

Proposed Collaborators

VideoBackendStateMachine

Pure or mostly pure lifecycle transition helper.

Responsibilities:

  • validate state transitions
  • produce transition observations
  • track failure reasons
  • keep start/stop/recovery behavior auditable

Non-responsibilities:

  • DeckLink API calls
  • rendering
  • persistence

PlayoutPolicy

Policy object for queue and timing behavior.

Expected fields:

  • target preroll frames
  • maximum ready frames
  • minimum spare device buffers
  • underrun behavior
  • maximum catch-up frames
  • adaptive headroom enabled/disabled

RenderOutputQueue

Bounded queue or ring for completed output frames.

Responsibilities:

  • accept completed render outputs
  • expose ready frames for scheduling
  • track depth, drops, stale reuse, and underruns
  • keep ownership/lifetime clear between render and backend

OutputFramePool

Backend-owned device buffer pool.

Responsibilities:

  • own DeckLink mutable frames
  • expose available buffers for render/readback or scheduling
  • recycle completed frames
  • report spare-buffer depth

PlayoutController

Coordinates policy, ready frames, device schedule times, and completion accounting.

Responsibilities:

  • preroll frames
  • schedule next frame
  • handle late/drop/completed/flushed results
  • apply underrun policy
  • publish timing state

Output Queue Policy

The initial output queue should be small and bounded.

Candidate defaults:

  • target ready frames: 2-3
  • max ready frames: 3-5
  • underrun: reuse last completed frame if available, otherwise black
  • late/drop: increase degraded counters and optionally increase headroom within limits

The exact numbers should be measured, but the policy should live in one place instead of being split across constants.

Underrun Policy

When no fresh rendered frame is available, options are:

  1. reuse newest completed frame
  2. reuse last scheduled frame
  3. schedule black/degraded frame
  4. skip/catch up schedule time

Phase 7 should pick one default and make it visible in telemetry. Reusing the newest completed frame is often the best first policy for live visual continuity, but key/fill behavior may require careful testing.

Migration Plan

Step 1. Name Lifecycle States

Introduce backend state enum and transition reporting without changing scheduling behavior much.

Initial target:

  • state changes are explicit
  • invalid transitions are detectable
  • tests cover allowed transitions

Step 2. Create Playout Policy Object

Unify fixed constants and scheduler assumptions.

Initial target:

  • frame pool size derives from policy
  • preroll count derives from policy
  • late/drop recovery reads policy

Step 3. Add Ready Output Queue

Introduce a bounded queue for completed output frames.

Initial target:

  • pure queue tests
  • explicit depth/underrun metrics
  • no DeckLink dependency in queue tests

Step 4. Move Callback Toward Dequeue/Schedule

Stop producing frames directly in the completion callback path.

Transitional target:

  • callback wakes/schedules a backend worker
  • worker consumes ready frames

Final target:

  • callback only records, recycles, dequeues, schedules

Step 5. Make Render Produce Ahead

Teach render/output code to keep the ready queue filled to target headroom.

Initial target:

  • render thread produces on demand until queue has target depth
  • callback does not synchronously wait for fresh render
  • stale/black fallback is explicit on underrun

Step 6. Replace Fixed Late/Drop Recovery

Replace fixed +2 schedule-index recovery with measured lag/headroom accounting.

Initial target:

  • track scheduled index, completed index, queue depth, late streak, drop streak
  • recovery decisions use measured lag

Step 7. Route Backend Health Structurally

Publish backend lifecycle, queue depth, underrun, late/drop, and degraded-state observations through HealthTelemetry.

Testing Strategy

Recommended tests:

  • allowed lifecycle transitions pass
  • invalid lifecycle transitions fail
  • playout policy derives frame pool/preroll sizes consistently
  • output queue preserves ordering
  • bounded output queue rejects/drops according to policy
  • underrun reuses last frame or black according to policy
  • late/drop accounting updates degraded state
  • scheduler catch-up uses measured lag, not fixed skip
  • stop drains/recycles device-frame ownership in pure fakes

Useful homes:

  • VideoPlayoutSchedulerTests for scheduler evolution
  • VideoIODeviceFakeTests for fake backend lifecycle
  • a new VideoBackendStateMachineTests
  • a new RenderOutputQueueTests

Risks

Latency Risk

More headroom means more latency. Phase 7 should make latency a visible policy choice.

Buffer Lifetime Risk

Render and backend will share ownership boundaries around output buffers. Frame ownership must be explicit to avoid reuse while hardware still owns a frame.

Underrun Policy Risk

Reusing stale frames can be visually acceptable, but wrong key/fill behavior may be worse than black. Test with real output.

Callback Thread Risk

Even after decoupling render, callback work must stay small and bounded.

Scope Risk

Backend lifecycle and playout queue are related, but either can grow large. Implement in small, testable slices.

Phase 7 Exit Criteria

Phase 7 can be considered complete once the project can say:

  • backend lifecycle states and transitions are explicit
  • playout policy owns preroll, pool size, headroom, and underrun behavior
  • output callbacks no longer synchronously wait for render production
  • render produces completed output frames into a bounded queue
  • underrun behavior is explicit and observable
  • late/drop recovery is measured rather than fixed skip-only
  • backend health reports lifecycle, queue, underrun, late, and dropped state
  • queue/lifecycle/scheduler behavior has non-DeckLink tests

Open Questions

  • What should the default ready-frame depth be at 30fps and 60fps?
  • Should underrun reuse last completed, last scheduled, or black?
  • Should output queue depth be user-configurable?
  • Should render cadence be driven by backend demand, a timer, or queue-fill pressure?
  • How should external keying influence stale-frame/black fallback?
  • Should input and output lifecycle states be separate endpoints under one backend shell?

Short Version

Phase 7 should stop making DeckLink callbacks wait for render.

Render produces ahead into a bounded queue. The backend consumes ready frames according to explicit lifecycle and playout policy. Queue depth, underruns, late frames, dropped frames, and degraded states become measured and visible.