6.9 KiB
Phase 7.5 Readback Experiment Log
This log tracks short readback experiments during the proactive playout timing work.
How To Run
The default debugger launch keeps the current production path:
Debug LoopThroughWithOpenGLCompositingVST_OUTPUT_READBACK_MODEunset- mode:
async_pbo
Comparison modes are still available:
VST_OUTPUT_READBACK_MODE=async_pbo- uses the older PBO/fence readback path
The experiment launches are:
-
Debug LoopThroughWithOpenGLCompositing - sync readback experiment -
VST_OUTPUT_READBACK_MODE=sync -
uses direct synchronous
glReadPixels()every output frame -
Debug LoopThroughWithOpenGLCompositing - cached output experiment -
VST_OUTPUT_READBACK_MODE=cached_only -
uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback
The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.
Experiment 3: fast_transfer
Status: removed from active code after hardware sample
Date: 2026-05-11
Change:
- DeckLink output frames are now created with
CreateVideoFrameWithBuffer(). - Output frame buffers are owned by
PinnedMemoryAllocator. VideoIOOutputFramecarries a texture-transfer callback.- The test branch changed the default render readback path to try
VideoFrameTransfer::GPUtoCPUagainst the output texture for BGRA output. - If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.
Question:
Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?
Result:
- The test machine reported
GL_VENDOR=NVIDIA CorporationandGL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2. - The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on
GL_RENDERERcontainingQuadro. GL_AMD_pinned_memorywas also unavailable.- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.
Baseline: async_pbo
Date: 2026-05-11
Observed while the app was running after adding the async queue split counters.
Summary:
- ready queue was pinned at 0 or briefly 1
- underrun, zero-depth, late, and dropped counts increased continuously
renderRequestMsusually sat around 16-25 ms, with occasional larger spikesasyncQueueMswas mostly explained byasyncQueueReadPixelsMs- PBO allocation/orphaning was effectively 0 ms
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
|---|---|---|---|---|---|---|---|---|
| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |
Initial read:
The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.
Experiment 1: sync
Status: sampled
Question:
Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?
Expected interpretation:
- If
syncReadMsis lower thanasyncQueueReadPixelsMsand the ready queue improves, the current PBO path is the wrong strategy for this driver/format. - If
syncReadMsis also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.
Results:
Date: 2026-05-11
Summary:
- ready queue remained pinned at 0
- underrun, zero-depth, late, and dropped counts continued increasing
asyncQueueMsand async readback counters were 0, confirming the experiment mode was active- direct
syncReadMswas generally worse than the baseline PBOasyncQueueReadPixelsMs
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
|---|---|---|---|---|---|---|
| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |
Read:
Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.
Experiment 2: cached_only
Status: sampled
Question:
If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?
Expected interpretation:
- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.
Results:
Date: 2026-05-11
User-visible result:
- DeckLink reported a healthy 5-frame buffer.
Telemetry summary:
renderRequestMsdropped to roughly 1-3 ms.cachedCopyMswas usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.asyncQueueMs,asyncQueueReadPixelsMs,syncReadMs, fence wait, map, and async copy were 0 after bootstrap.syncFallbackCountstayed at 1, confirming one bootstrap readback.cachedFallbackCountincreased continuously, confirming ongoing frames were served from cached CPU memory.- late and dropped counts were 0 during the sampled run.
- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |
Read:
Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.
The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.