aiden/video-shader-toys

Fork 0

Files

Aiden bf23cd880a

CI / React UI Build (push) Successful in 10s

Details

CI / Native Windows Build And Tests (push) Failing after 2m52s

Details

CI / Windows Release Package (push) Has been skipped

Details

faliure

2026-05-12 00:35:01 +10:00

13 KiB

Raw Blame History

Phase 7.6: System-Memory Playout Buffer Design

Status

In progress.

Implemented so far:

BGRA8 SystemOutputFramePool with non-GL tests
render/readback production now writes into app-owned system-memory slots
DeckLink output scheduling can wrap system-memory slots with CreateVideoFrameWithBuffer()
DeckLink completion callbacks release scheduled system-memory slots
ready-queue discard paths release owned frames instead of leaking slots
telemetry scaffolding exposes free, ready, and scheduled system-memory frame counts
async PBO readback is now a deeper pipeline by default and ordinary misses no longer flush queued readbacks
the output producer now honors requested burst production when the ready queue is below target instead of producing only one frame per wake

Still to verify/tune on hardware:

sustained DeckLink buffer depth
frame age at schedule/completion
repeat/underrun policy behavior under real stalls
whether deeper async readback reduces sawtooth buffer drain
whether BGRA8 bandwidth is sufficient before considering v210
whether burst filling keeps readyQueue.depth above zero and reduces the remaining short stutters

Phase 7.5 isolated the current playout timing problem around output readback and DeckLink scheduling pressure. The fast-transfer path from the DeckLink OpenGL sample is not available on the current test GPU, so the next direction is to make the normal path behave more like broadcast playout systems: render ahead, read back into system-memory frame buffers, and let DeckLink consume already-complete frames.

This phase is not a move away from rendering every frame. It is a move away from making DeckLink wait for each frame to be rendered and read back at the moment it needs to be scheduled.

SDK Finding: RGBA8 Is Not Required

DeckLink output frames do not have to be RGBA8/BGRA8.

The SDK accepts a BMDPixelFormat when creating output frames. The available formats include:

bmdFormat8BitYUV
bmdFormat10BitYUV
bmdFormat10BitYUVA
bmdFormat8BitARGB
bmdFormat8BitBGRA
bmdFormat10BitRGB
bmdFormat12BitRGB

The SDK samples also use non-RGBA output paths:

FilePlayback converts unsupported source frames to bmdFormat10BitYUV
PlaybackStills uses bmdFormat10BitYUV
InputLoopThrough handles bmdFormat10BitYUV and related formats
SignalGen exposes 8-bit YUV, 10-bit YUV, 8-bit RGB, and 10-bit RGB choices
OpenGLOutput uses BGRA for that sample path, but that is not a DeckLink API requirement

The current app already has partial support for this direction:

DeckLinkSession probes support for bmdFormat10BitYUV and bmdFormat10BitYUVA
VideoIOPixelFormat::V210 maps to bmdFormat10BitYUV
VideoIOPixelFormat::Yuva10 maps to bmdFormat10BitYUVA
RenderEngine has a 10-bit output packing path
row-byte calculation already distinguishes bmdFormat10BitYUV, bmdFormat10BitYUVA, and BGRA-style formats

So yes: we can pack before readback later if bandwidth proves to be the remaining bottleneck. For the first Phase 7.6 implementation, keep BGRA8 as the active output format and focus on the larger architectural problem: DeckLink should schedule from completed system-memory frames instead of waiting on the current render/readback operation.

Goal

Create a buffered system-memory playout path:

render every output frame
keep BGRA8 as the first output/readback format
read back into reusable CPU/system-memory frame slots
keep a small queue of completed frames ahead of DeckLink
schedule DeckLink from completed frames rather than from in-progress rendering
preserve telemetry so every experiment can be compared against Phase 7.5

Non-Goals

Do not reintroduce NVIDIA DVP or AMD pinned-memory as a required path.
Do not hide dropped, repeated, or late frames.
Do not make cached-output playback the default production behavior.
Do not add a large latency buffer without making that latency explicit.
Do not rewrite shader/effect evaluation unless profiling proves it is the bottleneck.
Do not make v210/YUV packing part of the first implementation unless BGRA8 buffering is proven insufficient.

Architecture

Current Problem Shape

The current path is still too close to:

DeckLink needs a frame.
App renders or finalizes a frame.
App reads back from GL.
App schedules the frame.

That can work only if every step reliably fits inside the frame budget. When readback stalls or scheduling is delayed, DeckLink sees a shallow buffer and playback freezes.

Target Shape

The target path is:

Render producer prepares future frames.
GPU output is read back as BGRA8 into the selected system frame slot.
Readback fills a free system-memory frame slot.
Completed slots enter a ready queue.
DeckLink scheduler consumes ready slots at output cadence.
Completion callbacks release slots back to the pool.

This gives the scheduler a small cushion without sacrificing rendered frames.

Pixel Format Strategy

First Target: BGRA8

Use BGRA8 as the first serious output target.

Reasons:

it is the path closest to the current renderer
it avoids introducing color-space packing risk while the buffering architecture is still being proven
it keeps alpha/keying behavior easier to reason about
it lets Phase 7.6 isolate scheduling/readback ownership from pixel-format conversion

Known byte cost at 1920x1080:

BGRA8: about 8.29 MB per frame

That is larger than v210, but the immediate hypothesis is that the freezes come from scheduling coupling and readback stalls, not only raw byte count. Prove or disprove that with the system-memory queue first.

Later Target: 10-bit YUV / v210

Keep v210 available as a later optimization.

Reasons to revisit it:

it is a native DeckLink output format
it can reduce 1920x1080 readback size from about 8.29 MB per BGRA8 frame to about 5.53 MB per v210 frame
it may better match final video I/O expectations for fill-only output

Do this only after the BGRA8 system-memory queue is measured. If BGRA8 buffering keeps DeckLink healthy, v210 becomes a quality/bandwidth refinement rather than a rescue path.

Alpha / Keying

For the first implementation, BGRA8 remains the default target.

For alpha/key workflows, bmdFormat10BitYUVA may be needed, or key/fill may need to remain split depending on the device mode and keyer configuration. Phase 7.6 should make this explicit rather than assuming one format fits both.

Proposed Components

`SystemOutputFramePool`

Owns reusable CPU-side frame slots.

Responsibilities:

allocate a fixed number of output slots
expose free slots to the render/readback producer
expose completed slots to the DeckLink scheduler
track slot generation, frame id, frame time, and pixel format
prevent reuse while DeckLink still owns or may read the frame

`OutputFrameSlot`

Represents one CPU/system-memory playout frame.

Likely contents:

pointer to writable frame bytes
row bytes
width and height
BMDPixelFormat or app-level equivalent
frame number / stream time
timing metadata
completion state
optional DeckLink frame wrapper

`DeckLinkOutputFrameAdapter`

Bridges app-owned memory to DeckLink output frames.

Options to evaluate:

create DeckLink frames with app-owned buffers where supported by the SDK
keep DeckLink-created frames in the pool and write directly into their bytes
wrap app memory behind a small IDeckLinkVideoFrame implementation only if needed

The simplest production path should avoid an extra CPU copy between app memory and DeckLink memory.

`OutputFrameProducer`

Runs on or is driven by the render thread.

Responsibilities:

acquire a free system frame slot
render the next frame
read back BGRA8 into the slot
publish the slot to the ready queue
record readback timings

`DeckLinkPlayoutScheduler`

Consumes completed system frames.

Responsibilities:

keep DeckLink scheduled ahead by the configured target depth
schedule from the ready queue
repeat/drop according to explicit policy when the queue is empty or too deep
release frame slots after DeckLink completion callbacks
report buffer depth and scheduling lead

Migration Plan

Step 1: Make Output Pixel Format Explicit Everywhere

Current format selection exists, but Phase 7.6 should make it impossible to confuse render texture format, readback format, and DeckLink scheduled format.

Deliverables:

log selected DeckLink output pixel format at startup
expose readback bytes per frame in telemetry
expose whether the frame was BGRA, v210, or YUVA
make BGRA8 the default and first supported system-buffer path

Step 2: Introduce the BGRA8 System Frame Pool

Add a fixed-size pool of BGRA8 system-memory output slots.

Initial target depth:

3 ready/scheduled frames minimum
5 frames as the practical DeckLink-health target
configurable for experiments

The pool should be testable without OpenGL or DeckLink hardware.

Step 3: Read Back BGRA8 Into Pool Slots

Move readback output away from transient buffers and into acquired frame slots.

The producer must never block DeckLink scheduling while waiting for a free slot if a safe repeat/drop policy can keep playback alive.

Step 4: Schedule From Completed Slots

Change DeckLink scheduling to consume completed system frames.

DeckLink callbacks should become the point where slots are returned to the pool.

This is the main behavioral change: scheduling no longer waits for the active render/readback operation.

Step 5: Add Playout Policies

Make underflow and overflow behavior explicit.

Possible policies:

repeat last completed frame on underflow
schedule black on startup only
drop oldest completed frame if the producer gets too far ahead
preserve most recent frame for live-control responsiveness

The default should favor stable output cadence and visible telemetry over silent correctness guesses.

Step 6: Tune Buffer Depth and Latency

Measure:

render time
readback time
CPU copy time, if any
ready queue depth
scheduled queue depth
frame age at schedule time
frame age at display callback
repeats, drops, and underruns

Then choose a default buffer depth that keeps DeckLink healthy without adding unnecessary latency.

Step 7: Optional v210 Experiment

Only after BGRA8 buffering has been measured, add a runtime option that forces:

GPU pack to v210
readback of packed v210 bytes
DeckLink scheduling as bmdFormat10BitYUV

This should be compared against the completed BGRA8 system-memory path, not against the older coupled path.

Telemetry

Keep the Phase 7.5 counters and add:

outputPixelFormat
outputReadbackBytes
outputPackMode
systemFramePoolFree
systemFramePoolReady
systemFramePoolScheduled
systemFrameAgeAtScheduleMs
systemFrameAgeAtCompletionMs
systemFrameUnderruns
systemFrameRepeats
systemFrameDrops
deckLinkScheduleLeadFrames
deckLinkScheduleLeadMs

Telemetry scaffolding can land before the frame pool itself. Until SystemOutputFramePool exists, these fields should remain producer-owned gauges/counters with default zero values in HealthTelemetry; they should not be inferred from the existing render-ready queue or DeckLink pool because those are adjacent concepts, not the final free/ready/scheduled system-memory slot model.

Existing counters that should remain useful:

render frame time
async queue time
readback timing
output queue depth
displayed late count
dropped count
DeckLink buffered frame count

Tests

Add non-GL tests for:

frame pool acquire/publish/consume/release
slots are not reused while scheduled
underflow repeats the last completed frame when configured
overflow drops according to policy
row-byte and byte-size calculation for BGRA8 first, with v210 and YUVA covered when those modes are enabled
scheduler consumes only completed frames
completion callback releases the expected slot

Hardware/manual tests:

BGRA8 system-buffered output works
DeckLink buffer depth stays healthy
no black-frame startup longer than configured preroll
shutdown drains or releases scheduled slots safely

Risks

DeckLink frame ownership rules may force one extra copy if app-owned buffers are not accepted in the exact path we use.
Buffering improves cadence but adds latency.
If GPU readback itself remains slower than real time, buffering only delays the underflow.
v210 remains a future optimization and may still carry color-space/keying risk when introduced.

Exit Criteria

Phase 7.6 is complete when:

DeckLink output format is explicit and logged
BGRA8 system-memory output slots are the default playout path
completed system-memory frames are queued ahead of DeckLink scheduling
DeckLink callbacks release/recycle frame slots
ready/scheduled buffer depth is visible in telemetry
underflow/repeat/drop behavior is explicit and tested
the app can sustain a healthy DeckLink buffer without using cached-output playback

13 KiB Raw Blame History

Phase 7.6: System-Memory Playout Buffer Design

Status

SDK Finding: RGBA8 Is Not Required

Goal

Non-Goals

Architecture

Current Problem Shape

Target Shape

Pixel Format Strategy

First Target: BGRA8

Later Target: 10-bit YUV / v210

Alpha / Keying

Proposed Components

SystemOutputFramePool

OutputFrameSlot

DeckLinkOutputFrameAdapter

OutputFrameProducer

DeckLinkPlayoutScheduler

Migration Plan

Step 1: Make Output Pixel Format Explicit Everywhere

Step 2: Introduce the BGRA8 System Frame Pool

Step 3: Read Back BGRA8 Into Pool Slots

Step 4: Schedule From Completed Slots

Step 5: Add Playout Policies

Step 6: Tune Buffer Depth and Latency

Step 7: Optional v210 Experiment

Telemetry

Tests

Risks

Exit Criteria

13 KiB

Raw Blame History

`SystemOutputFramePool`

`OutputFrameSlot`

`DeckLinkOutputFrameAdapter`

`OutputFrameProducer`

`DeckLinkPlayoutScheduler`