aiden/video-shader-toys

Fork 0

Files

Aiden 9e3412712c

CI / React UI Build (push) Successful in 11s

Details

CI / Native Windows Build And Tests (push) Successful in 2m52s

Details

CI / Windows Release Package (push) Successful in 3m0s

Details

Improvement

2026-05-12 00:00:23 +10:00

9.1 KiB

Raw Blame History

Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

How To Run

The default debugger launch keeps the current production path:

Debug LoopThroughWithOpenGLCompositing
VST_OUTPUT_READBACK_MODE unset
mode: async_pbo

Comparison modes are still available:

VST_OUTPUT_READBACK_MODE=async_pbo
uses the older PBO/fence readback path

The experiment launches are:

Debug LoopThroughWithOpenGLCompositing - sync readback experiment
VST_OUTPUT_READBACK_MODE=sync
uses direct synchronous glReadPixels() every output frame
Debug LoopThroughWithOpenGLCompositing - cached output experiment
VST_OUTPUT_READBACK_MODE=cached_only
uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

DeckLink output frames are now created with CreateVideoFrameWithBuffer().
Output frame buffers are owned by PinnedMemoryAllocator.
VideoIOOutputFrame carries a texture-transfer callback.
The test branch changed the default render readback path to try VideoFrameTransfer::GPUtoCPU against the output texture for BGRA output.
If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

The test machine reported GL_VENDOR=NVIDIA Corporation and GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2.
The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on GL_RENDERER containing Quadro.
GL_AMD_pinned_memory was also unavailable.
The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

ready queue was pinned at 0 or briefly 1
underrun, zero-depth, late, and dropped counts increased continuously
renderRequestMs usually sat around 16-25 ms, with occasional larger spikes
asyncQueueMs was mostly explained by asyncQueueReadPixelsMs
PBO allocation/orphaning was effectively 0 ms

Representative samples:

renderRequestMs	queueWaitMs	drawMs	mapMs	copyMs	asyncQueueMs	asyncQueueReadPixelsMs
24.915	3.018	0.510	0.923	0.768	9.018	9.001
16.226	3.066	0.518	1.202	0.812	8.611	8.598
12.134	3.796	3.579	1.378	0.690	10.323	10.311
17.496	2.817	0.523	1.267	1.160	9.416	9.403

Initial read:

The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

If syncReadMs is lower than asyncQueueReadPixelsMs and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
If syncReadMs is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

ready queue remained pinned at 0
underrun, zero-depth, late, and dropped counts continued increasing
asyncQueueMs and async readback counters were 0, confirming the experiment mode was active
direct syncReadMs was generally worse than the baseline PBO asyncQueueReadPixelsMs

Representative samples:

renderRequestMs	queueWaitMs	drawMs	syncReadMs	syncFallbackCount
32.467	5.764	1.389	23.122	680
29.722	2.603	0.512	25.538	697
37.844	7.716	0.518	23.608	706
22.304	3.089	1.843	15.278	723
27.196	4.015	0.500	21.933	736

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

renderRequestMs dropped to roughly 1-3 ms.
cachedCopyMs was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
asyncQueueMs, asyncQueueReadPixelsMs, syncReadMs, fence wait, map, and async copy were 0 after bootstrap.
syncFallbackCount stayed at 1, confirming one bootstrap readback.
cachedFallbackCount increased continuously, confirming ongoing frames were served from cached CPU memory.
late and dropped counts were 0 during the sampled run.
internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

readyDepth	renderRequestMs	queueWaitMs	drawMs	cachedCopyMs
0	1.446	0.018	0.518	0.864
0	2.586	1.089	0.514	0.829
0	1.481	2.378	0.502	0.911
0	0.892	0.013	0.468	0.371
1	1.398	0.019	0.483	0.819

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

Experiment 4: BGRA8 pack framebuffer async readback

Status: sampled

Date: 2026-05-11

Change:

The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback.
Async readback reads from the pack framebuffer using GL_BGRA / GL_UNSIGNED_INT_8_8_8_8_REV.
The deeper async PBO ring remains active.

Question:

Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side glReadPixels stall?

User-visible result:

Long pauses appear to be gone.
Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes.

Telemetry summary:

Throughput recovered to roughly real time in the sampled window.
Over 5 seconds, the app pushed and popped 305 output frames.
asyncQueueReadPixelsMs dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples.
renderMs usually sat around 2-5 ms in the sampled burst.
Late and dropped frame counts did not increase during the 5 second delta sample.
The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters.

Representative samples:

renderMs	smoothedRenderMs	drawMs	mapMs	copyMs	asyncQueueReadPixelsMs	queueWaitMs
4.855	9.494	0.570	0.234	0.822	0.128	0.026
1.957	9.041	0.468	0.139	0.604	0.048	0.016
3.366	5.879	0.513	1.166	0.692	0.129	0.022
5.208	6.492	2.209	1.358	0.714	0.090	0.061
2.957	8.852	0.537	1.041	0.547	0.087	0.040

Five-second delta:

pushed	popped	ready underruns	zero-depth samples	late delta	dropped delta	scheduled lead
305	305	129	671	0	0	20

Read:

The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.

9.1 KiB Raw Blame History

Phase 7.5 Readback Experiment Log

How To Run

Experiment 3: fast_transfer

Baseline: async_pbo

Experiment 1: sync

Experiment 2: cached_only

Experiment 4: BGRA8 pack framebuffer async readback

9.1 KiB

Raw Blame History