video-shader-toys/docs/PHASE_7_BACKEND_LIFECYCLE_PLAYOUT_DESIGN.md

# Phase 7 Design: Backend Lifecycle And Playout

This document expands Phase 7 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.

Phase 4 made the render thread the sole owner of normal runtime GL work, but output timing is still callback-coupled: DeckLink completion callbacks synchronously request render-thread output production before scheduling the next hardware frame. Phase 7 should make backend lifecycle, buffer policy, playout headroom, and recovery explicit.

Phase 5 clarified that live parameter layering stops at final render-state composition. Phase 7 should keep backend lifecycle, output queue ownership, buffer reuse, temporal/feedback resources, and stale-frame/underrun policy outside the persisted/committed/transient parameter model.

## Status

- Phase 7 design package: proposed.
- Phase 7 implementation: not started.
- Current alignment: `VideoBackend`, `VideoIODevice`, `DeckLinkSession`, and `VideoPlayoutScheduler` exist. Phase 4 removed callback-thread GL ownership, but the DeckLink completion path still waits for render-thread output production.

Current backend footholds:

- `VideoBackend` wraps device discovery/configuration, start/stop, input callback handling, output completion handling, and telemetry publication.
- `DeckLinkSession` owns DeckLink device handles, frame pool creation, preroll, keyer configuration, and scheduled playback.
- `VideoPlayoutScheduler` owns basic schedule time generation and simple late/drop skip-ahead behavior.
- `OpenGLVideoIOBridge` is the current adapter between `VideoBackend` and `RenderEngine`.
- `HealthTelemetry` receives some signal, render, and pacing stats.

## Why Phase 7 Exists

The current output path works only while render/readback stays comfortably inside budget. A late render can make the callback late, which reduces device-side headroom, which makes the next callback more fragile.

The resilience review calls this the main remaining live-resilience risk after Phase 4:

- output playout is still effectively render-on-demand from the DeckLink completion callback
- buffer pool size and preroll depth are not sourced from one policy
- late/dropped recovery is a fixed skip rule
- backend lifecycle is imperative rather than represented as explicit states

Phase 7 should separate hardware timing from render production.

## Goals

Phase 7 should establish:

- explicit backend lifecycle states and allowed transitions
- one playout policy for frame pool size, preroll, headroom, and underrun behavior
- a bounded producer/consumer output queue between render and DeckLink scheduling
- lightweight DeckLink callbacks that dequeue/schedule/account rather than render
- measured recovery from late/dropped frames
- structured backend health reporting
- tests for scheduler, queue, lifecycle, and underrun policy without DeckLink hardware

## Non-Goals

Phase 7 should not require:

- a new renderer
- changing shader/state composition
- changing committed-live or transient automation layering
- replacing DeckLink support with multiple backends
- full telemetry UI redesign
- removing every synchronous API immediately
- perfect adaptive latency policy in the first pass

## Target Timing Model

The target model is producer/consumer playout:

```text
RenderEngine/render scheduler produces completed output frames
  -> bounded ready-frame queue
  -> VideoBackend consumes ready frames
  -> DeckLink callback schedules already-prepared frames
```

The callback should not wait for rendering. It should:

- record completion result
- recycle/release completed buffers
- dequeue a ready frame or apply underrun policy
- schedule the next frame
- publish backend timing/health observations

The queue contains rendered output-frame ownership and scheduling metadata, not live parameter state. Parameter composition should already be resolved before an output frame enters this playout boundary.

## Target Lifecycle Model

Suggested backend states:

1. `Uninitialized`
2. `Discovering`
3. `Discovered`
4. `Configuring`
5. `Configured`
6. `Prerolling`
7. `Running`
8. `Degraded`
9. `Stopping`
10. `Stopped`
11. `Failed`

Suggested transition rules:

- `Uninitialized -> Discovering`
- `Discovering -> Discovered | Failed`
- `Discovered -> Configuring | Stopped`
- `Configuring -> Configured | Failed`
- `Configured -> Prerolling | Stopped`
- `Prerolling -> Running | Failed | Stopping`
- `Running -> Degraded | Stopping | Failed`
- `Degraded -> Running | Stopping | Failed`
- `Stopping -> Stopped`

The exact enum can change, but the lifecycle should become observable and testable.

## Proposed Collaborators

### `VideoBackendStateMachine`

Pure or mostly pure lifecycle transition helper.

Responsibilities:

- validate state transitions
- produce transition observations
- track failure reasons
- keep start/stop/recovery behavior auditable

Non-responsibilities:

- DeckLink API calls
- rendering
- persistence

### `PlayoutPolicy`

Policy object for queue and timing behavior.

Expected fields:

- target preroll frames
- maximum ready frames
- minimum spare device buffers
- underrun behavior
- maximum catch-up frames
- adaptive headroom enabled/disabled

### `RenderOutputQueue`

Bounded queue or ring for completed output frames.

Responsibilities:

- accept completed render outputs
- expose ready frames for scheduling
- track depth, drops, stale reuse, and underruns
- keep ownership/lifetime clear between render and backend

### `OutputFramePool`

Backend-owned device buffer pool.

Responsibilities:

- own DeckLink mutable frames
- expose available buffers for render/readback or scheduling
- recycle completed frames
- report spare-buffer depth

### `PlayoutController`

Coordinates policy, ready frames, device schedule times, and completion accounting.

Responsibilities:

- preroll frames
- schedule next frame
- handle late/drop/completed/flushed results
- apply underrun policy
- publish timing state

## Output Queue Policy

The initial output queue should be small and bounded.

Candidate defaults:

- target ready frames: 2-3
- max ready frames: 3-5
- underrun: reuse last completed frame if available, otherwise black
- late/drop: increase degraded counters and optionally increase headroom within limits

The exact numbers should be measured, but the policy should live in one place instead of being split across constants.

## Underrun Policy

When no fresh rendered frame is available, options are:

1. reuse newest completed frame
2. reuse last scheduled frame
3. schedule black/degraded frame
4. skip/catch up schedule time

Phase 7 should pick one default and make it visible in telemetry. Reusing the newest completed frame is often the best first policy for live visual continuity, but key/fill behavior may require careful testing.

## Migration Plan

### Step 1. Name Lifecycle States

Introduce backend state enum and transition reporting without changing scheduling behavior much.

Initial target:

- state changes are explicit
- invalid transitions are detectable
- tests cover allowed transitions

### Step 2. Create Playout Policy Object

Unify fixed constants and scheduler assumptions.

Initial target:

- frame pool size derives from policy
- preroll count derives from policy
- late/drop recovery reads policy

### Step 3. Add Ready Output Queue

Introduce a bounded queue for completed output frames.

Initial target:

- pure queue tests
- explicit depth/underrun metrics
- no DeckLink dependency in queue tests

### Step 4. Move Callback Toward Dequeue/Schedule

Stop producing frames directly in the completion callback path.

Transitional target:

- callback wakes/schedules a backend worker
- worker consumes ready frames

Final target:

- callback only records, recycles, dequeues, schedules

### Step 5. Make Render Produce Ahead

Teach render/output code to keep the ready queue filled to target headroom.

Initial target:

- render thread produces on demand until queue has target depth
- callback does not synchronously wait for fresh render
- stale/black fallback is explicit on underrun

### Step 6. Replace Fixed Late/Drop Recovery

Replace fixed `+2` schedule-index recovery with measured lag/headroom accounting.

Initial target:

- track scheduled index, completed index, queue depth, late streak, drop streak
- recovery decisions use measured lag

### Step 7. Route Backend Health Structurally

Publish backend lifecycle, queue depth, underrun, late/drop, and degraded-state observations through `HealthTelemetry`.

## Testing Strategy

Recommended tests:

- allowed lifecycle transitions pass
- invalid lifecycle transitions fail
- playout policy derives frame pool/preroll sizes consistently
- output queue preserves ordering
- bounded output queue rejects/drops according to policy
- underrun reuses last frame or black according to policy
- late/drop accounting updates degraded state
- scheduler catch-up uses measured lag, not fixed skip
- stop drains/recycles device-frame ownership in pure fakes

Useful homes:

- `VideoPlayoutSchedulerTests` for scheduler evolution
- `VideoIODeviceFakeTests` for fake backend lifecycle
- a new `VideoBackendStateMachineTests`
- a new `RenderOutputQueueTests`

## Risks

### Latency Risk

More headroom means more latency. Phase 7 should make latency a visible policy choice.

### Buffer Lifetime Risk

Render and backend will share ownership boundaries around output buffers. Frame ownership must be explicit to avoid reuse while hardware still owns a frame.

### Underrun Policy Risk

Reusing stale frames can be visually acceptable, but wrong key/fill behavior may be worse than black. Test with real output.

### Callback Thread Risk

Even after decoupling render, callback work must stay small and bounded.

### Scope Risk

Backend lifecycle and playout queue are related, but either can grow large. Implement in small, testable slices.

## Phase 7 Exit Criteria

Phase 7 can be considered complete once the project can say:

- [ ] backend lifecycle states and transitions are explicit
- [ ] playout policy owns preroll, pool size, headroom, and underrun behavior
- [ ] output callbacks no longer synchronously wait for render production
- [ ] render produces completed output frames into a bounded queue
- [ ] underrun behavior is explicit and observable
- [ ] late/drop recovery is measured rather than fixed skip-only
- [ ] backend health reports lifecycle, queue, underrun, late, and dropped state
- [ ] queue/lifecycle/scheduler behavior has non-DeckLink tests

## Open Questions

- What should the default ready-frame depth be at 30fps and 60fps?
- Should underrun reuse last completed, last scheduled, or black?
- Should output queue depth be user-configurable?
- Should render cadence be driven by backend demand, a timer, or queue-fill pressure?
- How should external keying influence stale-frame/black fallback?
- Should input and output lifecycle states be separate endpoints under one backend shell?

## Short Version

Phase 7 should stop making DeckLink callbacks wait for render.

Render produces ahead into a bounded queue. The backend consumes ready frames according to explicit lifecycle and playout policy. Queue depth, underruns, late frames, dropped frames, and degraded states become measured and visible.