aiden/video-shader-toys

Fork 0

Files

Aiden a434a88108

CI / React UI Build (push) Successful in 10s

Details

CI / Native Windows Build And Tests (push) Successful in 2m51s

Details

CI / Windows Release Package (push) Successful in 2m55s

Details

Performance chasing

2026-05-11 23:10:45 +10:00

6.9 KiB

Raw Blame History

Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

How To Run

The default debugger launch keeps the current production path:

Debug LoopThroughWithOpenGLCompositing
VST_OUTPUT_READBACK_MODE unset
mode: async_pbo

Comparison modes are still available:

VST_OUTPUT_READBACK_MODE=async_pbo
uses the older PBO/fence readback path

The experiment launches are:

Debug LoopThroughWithOpenGLCompositing - sync readback experiment
VST_OUTPUT_READBACK_MODE=sync
uses direct synchronous glReadPixels() every output frame
Debug LoopThroughWithOpenGLCompositing - cached output experiment
VST_OUTPUT_READBACK_MODE=cached_only
uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

DeckLink output frames are now created with CreateVideoFrameWithBuffer().
Output frame buffers are owned by PinnedMemoryAllocator.
VideoIOOutputFrame carries a texture-transfer callback.
The test branch changed the default render readback path to try VideoFrameTransfer::GPUtoCPU against the output texture for BGRA output.
If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

The test machine reported GL_VENDOR=NVIDIA Corporation and GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2.
The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on GL_RENDERER containing Quadro.
GL_AMD_pinned_memory was also unavailable.
The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

ready queue was pinned at 0 or briefly 1
underrun, zero-depth, late, and dropped counts increased continuously
renderRequestMs usually sat around 16-25 ms, with occasional larger spikes
asyncQueueMs was mostly explained by asyncQueueReadPixelsMs
PBO allocation/orphaning was effectively 0 ms

Representative samples:

renderRequestMs	queueWaitMs	drawMs	mapMs	copyMs	asyncQueueMs	asyncQueueReadPixelsMs
24.915	3.018	0.510	0.923	0.768	9.018	9.001
16.226	3.066	0.518	1.202	0.812	8.611	8.598
12.134	3.796	3.579	1.378	0.690	10.323	10.311
17.496	2.817	0.523	1.267	1.160	9.416	9.403

Initial read:

The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

If syncReadMs is lower than asyncQueueReadPixelsMs and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
If syncReadMs is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

ready queue remained pinned at 0
underrun, zero-depth, late, and dropped counts continued increasing
asyncQueueMs and async readback counters were 0, confirming the experiment mode was active
direct syncReadMs was generally worse than the baseline PBO asyncQueueReadPixelsMs

Representative samples:

renderRequestMs	queueWaitMs	drawMs	syncReadMs	syncFallbackCount
32.467	5.764	1.389	23.122	680
29.722	2.603	0.512	25.538	697
37.844	7.716	0.518	23.608	706
22.304	3.089	1.843	15.278	723
27.196	4.015	0.500	21.933	736

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

renderRequestMs dropped to roughly 1-3 ms.
cachedCopyMs was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
asyncQueueMs, asyncQueueReadPixelsMs, syncReadMs, fence wait, map, and async copy were 0 after bootstrap.
syncFallbackCount stayed at 1, confirming one bootstrap readback.
cachedFallbackCount increased continuously, confirming ongoing frames were served from cached CPU memory.
late and dropped counts were 0 during the sampled run.
internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

readyDepth	renderRequestMs	queueWaitMs	drawMs	cachedCopyMs
0	1.446	0.018	0.518	0.864
0	2.586	1.089	0.514	0.829
0	1.481	2.378	0.502	0.911
0	0.892	0.013	0.468	0.371
1	1.398	0.019	0.483	0.819

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

6.9 KiB Raw Blame History

Phase 7.5 Readback Experiment Log

How To Run

Experiment 3: fast_transfer

Baseline: async_pbo

Experiment 1: sync

Experiment 2: cached_only

6.9 KiB

Raw Blame History