Files

Aiden 0a1fe440d9 docs pass

2026-05-12 20:32:32 +10:00

14 KiB

Raw Blame History

DeckLink / OpenGL Lessons Learned

This document summarizes the practical lessons from the Phase 3-7.7 refactor work, especially the DeckLink playout timing experiments.

It is intentionally broader than the phase design docs. The goal is to preserve what we now know about the system so future architecture choices start from evidence instead of rediscovering the same constraints.

High-Level Lesson

The application is not just a renderer with a video output attached.

It is a real-time playout system with several independent clocks:

the selected output cadence, for example 59.94 fps
the GPU render/readback timeline
the DeckLink scheduled playback clock
the Windows thread scheduler
the input capture callback cadence
the preview/window message loop
the runtime/control update cadence

Stable playback depends on assigning one owner to each timing domain and keeping those domains loosely coupled.

What Worked

Named State Contracts Helped

RenderFrameInput and RenderFrameState made the render path easier to reason about.

Before that, frame rendering depended on scattered choices about snapshots, cache state, layer state, input source state, and runtime service state. Naming the frame contract made it possible to move logic out of RenderEngine and toward explicit frame construction.

Lesson:

keep frame inputs explicit
keep render-frame state immutable for the duration of a frame
avoid making the renderer ask global systems which state it should use mid-frame

Render-Thread Ownership Helped

Moving GL work behind a render-thread boundary reduced wrong-thread GL access risk and made ownership clearer.

The current render thread is still shared by output render, input upload, preview, screenshot, resize, and reset work, so it is not yet a pure output cadence thread. But the ownership direction is right.

Lesson:

GL context ownership should be explicit
public methods should enqueue or request work
render-thread methods should own GL bodies
synchronous calls should be reserved for places that genuinely need a result

Background Persistence Was Worth It

Moving persistence away from hot render/control paths reduced incidental latency risk and made state writes easier to reason about.

Lesson:

runtime/control persistence should not sit on output render timing
shutdown flushing is fine, steady-state blocking is not

Lifecycle State Was Worth It

The backend lifecycle model gave us better failure and shutdown vocabulary.

This became important once startup stopped being a single Start() call and became:

prepare output schedule
start render cadence
warm up real frames
start input streams
start scheduled playback

Lesson:

playout startup needs phases
degradation should be explicit
shutdown order should be deliberate and testable

What Did Not Work

Completion-Driven Rendering Was Too Fragile

Rendering on or near DeckLink completion can average the target frame rate, but it leaves no headroom.

When the callback asks for a frame just-in-time, any small delay in render, readback, scheduling, or Windows wake timing becomes visible as a buffer dip or stutter.

Lesson:

DeckLink completion should release scheduled resources and wake scheduling
it should not render
it should not decide visual fallback policy in steady state

Black Fallback Hid The Real Timing Problem

Scheduling black on app-ready underrun made the pipeline appear to keep moving while producing visible black flicker.

It also made diagnosis harder because DeckLink could have scheduled frames while the app visibly failed.

Lesson:

black is a startup/error/degraded-state policy, not normal steady-state recovery
steady-state underruns should be measured as timing failures

Synthetic Schedule Lead Was Misleading

The synthetic scheduled/completed index could report a large buffer while DeckLink still showed low actual device buffer depth.

Real DeckLink GetBufferedVideoFrameCount() telemetry was necessary to separate:

app-owned scheduled slots
synthetic schedule lead
actual hardware/device buffer depth

Lesson:

measure actual device buffer depth
keep synthetic counters only as diagnostics
do not infer device health from internal stream indexes alone

More Buffer Is Not Automatically Smoother

Increasing DeckLink scheduled frames sometimes made the reported device buffer look healthier while visible motion still stuttered.

The problem was not only "how many frames are scheduled"; it was also whether the scheduled frames represented a stable render cadence.

Lesson:

buffer depth absorbs jitter, but it cannot fix bad cadence ownership
a full buffer of poorly timed or repeated frames can still look wrong

Speed-Up Catch-Up Was The Wrong Instinct

Letting the producer sprint to refill the buffer created new timing artifacts.

The render side should behave like a stable game/render loop: render at the selected cadence, record lateness, and only skip ticks when render/GPU work itself overruns.

Lesson:

the render thread should not render faster because DeckLink is empty
buffer drain is a failure signal, not a sprint signal
warmup should fill buffers before playback starts

GPU Readback Lessons

The Original Readback Path Was The Major Collapse

Early Phase 7.5 telemetry showed glReadPixels(..., nullptr) into the PBO costing roughly 8-14 ms on representative samples. That was enough to collapse ready depth and cause long freezes.

Direct synchronous readback was worse on the sampled machine.

Cached-output mode, while visually invalid for live output, immediately recovered timing. That proved ongoing GPU-to-CPU transfer was the major cost in that version of the path.

Lesson:

isolate readback cost from render cost
use intentionally invalid cached-output experiments when diagnosing throughput
do not assume async PBO is actually cheap on every format/driver path

BGRA8 Packing Changed The Problem

Changing the output path so readback matched the DeckLink BGRA8 format made asyncQueueReadPixelsMs drop dramatically in sampled runs.

Long pauses disappeared and the remaining issue became short stutters/cadence gaps.

Lesson:

output/readback format matters
avoid format conversions on the readback path when possible
BGRA8 is a good current format target for experiments
v210/YUV packing can be deferred until cadence is stable

DeckLink SDK Fast Transfer Was Not Available On The Test GPU

The SDK OpenGL fast-transfer path depends on hardware/extension support that was not present on the RTX 4060 Ti test machine:

NVIDIA DVP path was gated around Quadro-style support
GL_AMD_pinned_memory was not exposed

Lesson:

SDK fast-transfer samples are useful references but not a universal fix
unsupported fast-transfer code should not be central to the architecture
the default path must work with ordinary consumer GPUs

DeckLink Lessons

DeckLink Wants Scheduled System-Memory Frames

Using CreateVideoFrameWithBuffer() lets DeckLink schedule frames backed by our system-memory slots.

That is the right ownership model for this app:

render/readback writes into a slot
DeckLink schedules a frame that references that slot
the slot is protected until DeckLink completion

Lesson:

system-memory slots are the contract between render and playout
scheduled slots must not be recycled early
completed-but-unscheduled slots can be latest-N cache entries

Startup Needs Real Preroll

Starting scheduled playback before real rendered frames exist creates avoidable startup fragility.

The better startup shape is:

prepare the DeckLink schedule
start render cadence
render warmup frames at normal cadence
schedule those frames as preroll
start DeckLink scheduled playback

Lesson:

do not use black preroll as the normal startup path
do not render faster during warmup
if warmup cannot fill in a bounded time, fail/degrade visibly

Buffering Lessons

There Are Two Different Buffers

The app has at least two important frame stores:

system-memory completed/latest-N frames
DeckLink scheduled/device buffer

They have different ownership rules.

Completed-but-unscheduled frames are disposable if a newer frame is available and cadence needs the slot.

Scheduled frames are not disposable because DeckLink may still read them.

Lesson:

latest-N completed frames are a cache
scheduled frames are owned by DeckLink until completion
keep metrics for both

Consume-Before-Render Is The Wrong Model For Completed Frames

If the render cadence waits for completed frames to be consumed, DeckLink timing can indirectly slow the renderer.

That couples the clocks again.

Lesson:

render cadence should keep rendering at selected cadence
if completed cache is full, recycle/drop the oldest unscheduled completed frame
only scheduled/in-flight saturation should prevent rendering to a safe slot

Render Thread Lessons

The Current Render Thread Is Still Shared

The GL render thread currently handles:

output rendering
input upload
preview present
screenshot capture
render reset commands
shader/resource operations

Output render can therefore be delayed by queued or inline work.

Lesson:

"one GL thread" is not the same as "one output cadence thread"
output render should become the highest-priority GL operation
non-output GL work needs budgets, coalescing, or deferral

Input Upload Is A Suspect Timing Coupling

Output render currently processes input upload work immediately before rendering the output frame.

That keeps input fresh but can steal time from the exact frame we are trying to render on cadence.

Lesson:

measure input upload count and time immediately before output render
test policies such as one_before_output or skip_before_output
prefer latest-input semantics over draining every pending upload

CPU Input Conversion Can Be Worse Than Input Copy

When DeckLink input only exposed UYVY8 on the test machine, an initial CPU UYVY-to-BGRA conversion in the input callback measured around a full-frame budget on sampled runs and reduced input cadence dramatically.

Moving the input edge to raw UYVY8 capture changed the ownership:

DeckLink callback copies raw supported input bytes into InputFrameMailbox
the mailbox keeps latest-frame semantics and uses a contiguous copy when row strides match
the render thread uploads/decodes UYVY8 into the shader-visible gVideoInput texture
runtime shaders continue to see decoded input, not packed capture bytes

Lesson:

keep input callbacks as capture/copy edges
keep GL decode/upload in the render-owned path
measure input copy, upload, and decode separately
do not hide expensive format conversion inside the DeckLink callback

Preview And Screenshot Must Stay Secondary

Preview is useful, but DeckLink output is the real-time path.

Screenshot and preview share GL resources and can block or queue work on the same render thread.

Lesson:

preview should be skipped when output is under pressure
screenshot capture should be treated as disruptive unless proven otherwise
forced preview/screenshot should be visible in telemetry

Telemetry Lessons

The useful telemetry has been the telemetry that separates domains:

output render queue wait
render/draw time
readback queue time
readback fence/map/copy time
app ready/completed queue depth
system-memory free/rendering/completed/scheduled counts
actual DeckLink buffered-frame count
DeckLink schedule-call time/failures
late/drop completion counts

Lesson:

averages are not enough
timing spikes matter more than steady low values
count ownership states, not just queue depth
keep experiment logs short and evidence-based

Current Architectural Direction

The current direction is still sound:

Render cadence loop
  renders at selected output cadence
  writes latest-N completed system-memory frames
  never sprints to refill DeckLink

Frame store
  owns free / rendering / completed / scheduled slots
  recycles unscheduled completed frames when needed
  protects scheduled frames until completion

DeckLink playout scheduler
  consumes completed frames
  tops up actual device buffer
  never renders

Completion callback
  releases scheduled slots
  records completion result
  wakes scheduler

Rewrite Lesson

A full restart is not obviously the right next move.

The current repo now contains:

working runtime/control architecture
useful phase docs
non-GL tests around key state machines
real telemetry
a clearer understanding of DeckLink and OpenGL timing

The better next step is likely a contained "V2 spine" inside the current app:

harden the render cadence loop
harden the frame store
separate DeckLink scheduling
demote preview/screenshot/input upload below output cadence
delete old compatibility branches as they become unnecessary

A full rewrite becomes attractive only if the current GL ownership model cannot be made deterministic without excessive surgery, or if the project switches rendering API.

Practical Rules Going Forward

One timing authority per domain.
Render cadence is time-driven, not completion-driven.
DeckLink scheduling is device-buffer-driven, not render-driven.
Completion callbacks release and report; they do not render.
System-memory completed frames are latest-N cache entries.
Scheduled frames are protected until DeckLink completion.
Startup uses real rendered warmup/preroll.
Black fallback is degraded/error behavior, not steady-state behavior.
Output render has priority over preview, screenshot, and bulk input upload.
Measure before adding recovery branches.

14 KiB Raw Blame History

DeckLink / OpenGL Lessons Learned

High-Level Lesson

What Worked

Named State Contracts Helped

Render-Thread Ownership Helped

Background Persistence Was Worth It

Lifecycle State Was Worth It

What Did Not Work

Completion-Driven Rendering Was Too Fragile

Black Fallback Hid The Real Timing Problem

Synthetic Schedule Lead Was Misleading

More Buffer Is Not Automatically Smoother

Speed-Up Catch-Up Was The Wrong Instinct

GPU Readback Lessons

The Original Readback Path Was The Major Collapse

BGRA8 Packing Changed The Problem

DeckLink SDK Fast Transfer Was Not Available On The Test GPU

DeckLink Lessons

DeckLink Wants Scheduled System-Memory Frames

Startup Needs Real Preroll

Buffering Lessons

There Are Two Different Buffers

Consume-Before-Render Is The Wrong Model For Completed Frames

Render Thread Lessons

The Current Render Thread Is Still Shared

Input Upload Is A Suspect Timing Coupling

CPU Input Conversion Can Be Worse Than Input Copy

Preview And Screenshot Must Stay Secondary

Telemetry Lessons

Current Architectural Direction

Rewrite Lesson

Practical Rules Going Forward

14 KiB

Raw Blame History