Files
video-shader-toys/docs/DECKLINK_OPENGL_LESSONS_LEARNED.md
2026-05-12 20:32:32 +10:00

14 KiB

DeckLink / OpenGL Lessons Learned

This document summarizes the practical lessons from the Phase 3-7.7 refactor work, especially the DeckLink playout timing experiments.

It is intentionally broader than the phase design docs. The goal is to preserve what we now know about the system so future architecture choices start from evidence instead of rediscovering the same constraints.

High-Level Lesson

The application is not just a renderer with a video output attached.

It is a real-time playout system with several independent clocks:

  • the selected output cadence, for example 59.94 fps
  • the GPU render/readback timeline
  • the DeckLink scheduled playback clock
  • the Windows thread scheduler
  • the input capture callback cadence
  • the preview/window message loop
  • the runtime/control update cadence

Stable playback depends on assigning one owner to each timing domain and keeping those domains loosely coupled.

What Worked

Named State Contracts Helped

RenderFrameInput and RenderFrameState made the render path easier to reason about.

Before that, frame rendering depended on scattered choices about snapshots, cache state, layer state, input source state, and runtime service state. Naming the frame contract made it possible to move logic out of RenderEngine and toward explicit frame construction.

Lesson:

  • keep frame inputs explicit
  • keep render-frame state immutable for the duration of a frame
  • avoid making the renderer ask global systems which state it should use mid-frame

Render-Thread Ownership Helped

Moving GL work behind a render-thread boundary reduced wrong-thread GL access risk and made ownership clearer.

The current render thread is still shared by output render, input upload, preview, screenshot, resize, and reset work, so it is not yet a pure output cadence thread. But the ownership direction is right.

Lesson:

  • GL context ownership should be explicit
  • public methods should enqueue or request work
  • render-thread methods should own GL bodies
  • synchronous calls should be reserved for places that genuinely need a result

Background Persistence Was Worth It

Moving persistence away from hot render/control paths reduced incidental latency risk and made state writes easier to reason about.

Lesson:

  • runtime/control persistence should not sit on output render timing
  • shutdown flushing is fine, steady-state blocking is not

Lifecycle State Was Worth It

The backend lifecycle model gave us better failure and shutdown vocabulary.

This became important once startup stopped being a single Start() call and became:

  • prepare output schedule
  • start render cadence
  • warm up real frames
  • start input streams
  • start scheduled playback

Lesson:

  • playout startup needs phases
  • degradation should be explicit
  • shutdown order should be deliberate and testable

What Did Not Work

Completion-Driven Rendering Was Too Fragile

Rendering on or near DeckLink completion can average the target frame rate, but it leaves no headroom.

When the callback asks for a frame just-in-time, any small delay in render, readback, scheduling, or Windows wake timing becomes visible as a buffer dip or stutter.

Lesson:

  • DeckLink completion should release scheduled resources and wake scheduling
  • it should not render
  • it should not decide visual fallback policy in steady state

Black Fallback Hid The Real Timing Problem

Scheduling black on app-ready underrun made the pipeline appear to keep moving while producing visible black flicker.

It also made diagnosis harder because DeckLink could have scheduled frames while the app visibly failed.

Lesson:

  • black is a startup/error/degraded-state policy, not normal steady-state recovery
  • steady-state underruns should be measured as timing failures

Synthetic Schedule Lead Was Misleading

The synthetic scheduled/completed index could report a large buffer while DeckLink still showed low actual device buffer depth.

Real DeckLink GetBufferedVideoFrameCount() telemetry was necessary to separate:

  • app-owned scheduled slots
  • synthetic schedule lead
  • actual hardware/device buffer depth

Lesson:

  • measure actual device buffer depth
  • keep synthetic counters only as diagnostics
  • do not infer device health from internal stream indexes alone

More Buffer Is Not Automatically Smoother

Increasing DeckLink scheduled frames sometimes made the reported device buffer look healthier while visible motion still stuttered.

The problem was not only "how many frames are scheduled"; it was also whether the scheduled frames represented a stable render cadence.

Lesson:

  • buffer depth absorbs jitter, but it cannot fix bad cadence ownership
  • a full buffer of poorly timed or repeated frames can still look wrong

Speed-Up Catch-Up Was The Wrong Instinct

Letting the producer sprint to refill the buffer created new timing artifacts.

The render side should behave like a stable game/render loop: render at the selected cadence, record lateness, and only skip ticks when render/GPU work itself overruns.

Lesson:

  • the render thread should not render faster because DeckLink is empty
  • buffer drain is a failure signal, not a sprint signal
  • warmup should fill buffers before playback starts

GPU Readback Lessons

The Original Readback Path Was The Major Collapse

Early Phase 7.5 telemetry showed glReadPixels(..., nullptr) into the PBO costing roughly 8-14 ms on representative samples. That was enough to collapse ready depth and cause long freezes.

Direct synchronous readback was worse on the sampled machine.

Cached-output mode, while visually invalid for live output, immediately recovered timing. That proved ongoing GPU-to-CPU transfer was the major cost in that version of the path.

Lesson:

  • isolate readback cost from render cost
  • use intentionally invalid cached-output experiments when diagnosing throughput
  • do not assume async PBO is actually cheap on every format/driver path

BGRA8 Packing Changed The Problem

Changing the output path so readback matched the DeckLink BGRA8 format made asyncQueueReadPixelsMs drop dramatically in sampled runs.

Long pauses disappeared and the remaining issue became short stutters/cadence gaps.

Lesson:

  • output/readback format matters
  • avoid format conversions on the readback path when possible
  • BGRA8 is a good current format target for experiments
  • v210/YUV packing can be deferred until cadence is stable

The SDK OpenGL fast-transfer path depends on hardware/extension support that was not present on the RTX 4060 Ti test machine:

  • NVIDIA DVP path was gated around Quadro-style support
  • GL_AMD_pinned_memory was not exposed

Lesson:

  • SDK fast-transfer samples are useful references but not a universal fix
  • unsupported fast-transfer code should not be central to the architecture
  • the default path must work with ordinary consumer GPUs

Using CreateVideoFrameWithBuffer() lets DeckLink schedule frames backed by our system-memory slots.

That is the right ownership model for this app:

  • render/readback writes into a slot
  • DeckLink schedules a frame that references that slot
  • the slot is protected until DeckLink completion

Lesson:

  • system-memory slots are the contract between render and playout
  • scheduled slots must not be recycled early
  • completed-but-unscheduled slots can be latest-N cache entries

Startup Needs Real Preroll

Starting scheduled playback before real rendered frames exist creates avoidable startup fragility.

The better startup shape is:

  • prepare the DeckLink schedule
  • start render cadence
  • render warmup frames at normal cadence
  • schedule those frames as preroll
  • start DeckLink scheduled playback

Lesson:

  • do not use black preroll as the normal startup path
  • do not render faster during warmup
  • if warmup cannot fill in a bounded time, fail/degrade visibly

Buffering Lessons

There Are Two Different Buffers

The app has at least two important frame stores:

  • system-memory completed/latest-N frames
  • DeckLink scheduled/device buffer

They have different ownership rules.

Completed-but-unscheduled frames are disposable if a newer frame is available and cadence needs the slot.

Scheduled frames are not disposable because DeckLink may still read them.

Lesson:

  • latest-N completed frames are a cache
  • scheduled frames are owned by DeckLink until completion
  • keep metrics for both

Consume-Before-Render Is The Wrong Model For Completed Frames

If the render cadence waits for completed frames to be consumed, DeckLink timing can indirectly slow the renderer.

That couples the clocks again.

Lesson:

  • render cadence should keep rendering at selected cadence
  • if completed cache is full, recycle/drop the oldest unscheduled completed frame
  • only scheduled/in-flight saturation should prevent rendering to a safe slot

Render Thread Lessons

The Current Render Thread Is Still Shared

The GL render thread currently handles:

  • output rendering
  • input upload
  • preview present
  • screenshot capture
  • render reset commands
  • shader/resource operations

Output render can therefore be delayed by queued or inline work.

Lesson:

  • "one GL thread" is not the same as "one output cadence thread"
  • output render should become the highest-priority GL operation
  • non-output GL work needs budgets, coalescing, or deferral

Input Upload Is A Suspect Timing Coupling

Output render currently processes input upload work immediately before rendering the output frame.

That keeps input fresh but can steal time from the exact frame we are trying to render on cadence.

Lesson:

  • measure input upload count and time immediately before output render
  • test policies such as one_before_output or skip_before_output
  • prefer latest-input semantics over draining every pending upload

CPU Input Conversion Can Be Worse Than Input Copy

When DeckLink input only exposed UYVY8 on the test machine, an initial CPU UYVY-to-BGRA conversion in the input callback measured around a full-frame budget on sampled runs and reduced input cadence dramatically.

Moving the input edge to raw UYVY8 capture changed the ownership:

  • DeckLink callback copies raw supported input bytes into InputFrameMailbox
  • the mailbox keeps latest-frame semantics and uses a contiguous copy when row strides match
  • the render thread uploads/decodes UYVY8 into the shader-visible gVideoInput texture
  • runtime shaders continue to see decoded input, not packed capture bytes

Lesson:

  • keep input callbacks as capture/copy edges
  • keep GL decode/upload in the render-owned path
  • measure input copy, upload, and decode separately
  • do not hide expensive format conversion inside the DeckLink callback

Preview And Screenshot Must Stay Secondary

Preview is useful, but DeckLink output is the real-time path.

Screenshot and preview share GL resources and can block or queue work on the same render thread.

Lesson:

  • preview should be skipped when output is under pressure
  • screenshot capture should be treated as disruptive unless proven otherwise
  • forced preview/screenshot should be visible in telemetry

Telemetry Lessons

The useful telemetry has been the telemetry that separates domains:

  • output render queue wait
  • render/draw time
  • readback queue time
  • readback fence/map/copy time
  • app ready/completed queue depth
  • system-memory free/rendering/completed/scheduled counts
  • actual DeckLink buffered-frame count
  • DeckLink schedule-call time/failures
  • late/drop completion counts

Lesson:

  • averages are not enough
  • timing spikes matter more than steady low values
  • count ownership states, not just queue depth
  • keep experiment logs short and evidence-based

Current Architectural Direction

The current direction is still sound:

Render cadence loop
  renders at selected output cadence
  writes latest-N completed system-memory frames
  never sprints to refill DeckLink

Frame store
  owns free / rendering / completed / scheduled slots
  recycles unscheduled completed frames when needed
  protects scheduled frames until completion

DeckLink playout scheduler
  consumes completed frames
  tops up actual device buffer
  never renders

Completion callback
  releases scheduled slots
  records completion result
  wakes scheduler

Rewrite Lesson

A full restart is not obviously the right next move.

The current repo now contains:

  • working runtime/control architecture
  • useful phase docs
  • non-GL tests around key state machines
  • real telemetry
  • a clearer understanding of DeckLink and OpenGL timing

The better next step is likely a contained "V2 spine" inside the current app:

  • harden the render cadence loop
  • harden the frame store
  • separate DeckLink scheduling
  • demote preview/screenshot/input upload below output cadence
  • delete old compatibility branches as they become unnecessary

A full rewrite becomes attractive only if the current GL ownership model cannot be made deterministic without excessive surgery, or if the project switches rendering API.

Practical Rules Going Forward

  • One timing authority per domain.
  • Render cadence is time-driven, not completion-driven.
  • DeckLink scheduling is device-buffer-driven, not render-driven.
  • Completion callbacks release and report; they do not render.
  • System-memory completed frames are latest-N cache entries.
  • Scheduled frames are protected until DeckLink completion.
  • Startup uses real rendered warmup/preroll.
  • Black fallback is degraded/error behavior, not steady-state behavior.
  • Output render has priority over preview, screenshot, and bulk input upload.
  • Measure before adding recovery branches.