Files
video-shader-toys/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
Aiden 9e3412712c
All checks were successful
CI / React UI Build (push) Successful in 11s
CI / Native Windows Build And Tests (push) Successful in 2m52s
CI / Windows Release Package (push) Successful in 3m0s
Improvement
2026-05-12 00:00:23 +10:00

9.1 KiB

Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

How To Run

The default debugger launch keeps the current production path:

  • Debug LoopThroughWithOpenGLCompositing
  • VST_OUTPUT_READBACK_MODE unset
  • mode: async_pbo

Comparison modes are still available:

  • VST_OUTPUT_READBACK_MODE=async_pbo
  • uses the older PBO/fence readback path

The experiment launches are:

  • Debug LoopThroughWithOpenGLCompositing - sync readback experiment

  • VST_OUTPUT_READBACK_MODE=sync

  • uses direct synchronous glReadPixels() every output frame

  • Debug LoopThroughWithOpenGLCompositing - cached output experiment

  • VST_OUTPUT_READBACK_MODE=cached_only

  • uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

  • DeckLink output frames are now created with CreateVideoFrameWithBuffer().
  • Output frame buffers are owned by PinnedMemoryAllocator.
  • VideoIOOutputFrame carries a texture-transfer callback.
  • The test branch changed the default render readback path to try VideoFrameTransfer::GPUtoCPU against the output texture for BGRA output.
  • If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

  • The test machine reported GL_VENDOR=NVIDIA Corporation and GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2.
  • The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on GL_RENDERER containing Quadro.
  • GL_AMD_pinned_memory was also unavailable.
  • The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

  • ready queue was pinned at 0 or briefly 1
  • underrun, zero-depth, late, and dropped counts increased continuously
  • renderRequestMs usually sat around 16-25 ms, with occasional larger spikes
  • asyncQueueMs was mostly explained by asyncQueueReadPixelsMs
  • PBO allocation/orphaning was effectively 0 ms

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs mapMs copyMs asyncQueueMs asyncQueueBufferMs asyncQueueReadPixelsMs
0 24.915 3.018 0.510 0.923 0.768 9.018 0.000 9.001
0 16.226 3.066 0.518 1.202 0.812 8.611 0.000 8.598
0 12.134 3.796 3.579 1.378 0.690 10.323 0.000 10.311
0 17.496 2.817 0.523 1.267 1.160 9.416 0.000 9.403

Initial read:

The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

  • If syncReadMs is lower than asyncQueueReadPixelsMs and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
  • If syncReadMs is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

  • ready queue remained pinned at 0
  • underrun, zero-depth, late, and dropped counts continued increasing
  • asyncQueueMs and async readback counters were 0, confirming the experiment mode was active
  • direct syncReadMs was generally worse than the baseline PBO asyncQueueReadPixelsMs

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs syncReadMs asyncQueueMs syncFallbackCount
0 32.467 5.764 1.389 23.122 0.000 680
0 29.722 2.603 0.512 25.538 0.000 697
0 37.844 7.716 0.518 23.608 0.000 706
0 22.304 3.089 1.843 15.278 0.000 723
0 27.196 4.015 0.500 21.933 0.000 736

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

  • If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
  • If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

  • DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

  • renderRequestMs dropped to roughly 1-3 ms.
  • cachedCopyMs was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
  • asyncQueueMs, asyncQueueReadPixelsMs, syncReadMs, fence wait, map, and async copy were 0 after bootstrap.
  • syncFallbackCount stayed at 1, confirming one bootstrap readback.
  • cachedFallbackCount increased continuously, confirming ongoing frames were served from cached CPU memory.
  • late and dropped counts were 0 during the sampled run.
  • internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs cachedCopyMs asyncQueueMs syncReadMs late dropped
0 1.446 0.018 0.518 0.864 0.000 0.000 0 0
0 2.586 1.089 0.514 0.829 0.000 0.000 0 0
0 1.481 2.378 0.502 0.911 0.000 0.000 0 0
0 0.892 0.013 0.468 0.371 0.000 0.000 0 0
1 1.398 0.019 0.483 0.819 0.000 0.000 0 0

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

Experiment 4: BGRA8 pack framebuffer async readback

Status: sampled

Date: 2026-05-11

Change:

  • The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback.
  • Async readback reads from the pack framebuffer using GL_BGRA / GL_UNSIGNED_INT_8_8_8_8_REV.
  • The deeper async PBO ring remains active.

Question:

Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side glReadPixels stall?

User-visible result:

  • Long pauses appear to be gone.
  • Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes.

Telemetry summary:

  • Throughput recovered to roughly real time in the sampled window.
  • Over 5 seconds, the app pushed and popped 305 output frames.
  • asyncQueueReadPixelsMs dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples.
  • renderMs usually sat around 2-5 ms in the sampled burst.
  • Late and dropped frame counts did not increase during the 5 second delta sample.
  • The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters.

Representative samples:

readyDepth renderMs smoothedRenderMs drawMs mapMs copyMs asyncQueueReadPixelsMs queueWaitMs
0 4.855 9.494 0.570 0.234 0.822 0.128 0.026
0 1.957 9.041 0.468 0.139 0.604 0.048 0.016
0 3.366 5.879 0.513 1.166 0.692 0.129 0.022
0 5.208 6.492 2.209 1.358 0.714 0.090 0.061
0 2.957 8.852 0.537 1.041 0.547 0.087 0.040

Five-second delta:

pushed popped ready underruns zero-depth samples late delta dropped delta scheduled lead
305 305 129 671 0 0 20

Read:

The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.