Files
video-shader-toys/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
Aiden a434a88108
All checks were successful
CI / React UI Build (push) Successful in 10s
CI / Native Windows Build And Tests (push) Successful in 2m51s
CI / Windows Release Package (push) Successful in 2m55s
Performance chasing
2026-05-11 23:10:45 +10:00

6.9 KiB

Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

How To Run

The default debugger launch keeps the current production path:

  • Debug LoopThroughWithOpenGLCompositing
  • VST_OUTPUT_READBACK_MODE unset
  • mode: async_pbo

Comparison modes are still available:

  • VST_OUTPUT_READBACK_MODE=async_pbo
  • uses the older PBO/fence readback path

The experiment launches are:

  • Debug LoopThroughWithOpenGLCompositing - sync readback experiment

  • VST_OUTPUT_READBACK_MODE=sync

  • uses direct synchronous glReadPixels() every output frame

  • Debug LoopThroughWithOpenGLCompositing - cached output experiment

  • VST_OUTPUT_READBACK_MODE=cached_only

  • uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

  • DeckLink output frames are now created with CreateVideoFrameWithBuffer().
  • Output frame buffers are owned by PinnedMemoryAllocator.
  • VideoIOOutputFrame carries a texture-transfer callback.
  • The test branch changed the default render readback path to try VideoFrameTransfer::GPUtoCPU against the output texture for BGRA output.
  • If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

  • The test machine reported GL_VENDOR=NVIDIA Corporation and GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2.
  • The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on GL_RENDERER containing Quadro.
  • GL_AMD_pinned_memory was also unavailable.
  • The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

  • ready queue was pinned at 0 or briefly 1
  • underrun, zero-depth, late, and dropped counts increased continuously
  • renderRequestMs usually sat around 16-25 ms, with occasional larger spikes
  • asyncQueueMs was mostly explained by asyncQueueReadPixelsMs
  • PBO allocation/orphaning was effectively 0 ms

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs mapMs copyMs asyncQueueMs asyncQueueBufferMs asyncQueueReadPixelsMs
0 24.915 3.018 0.510 0.923 0.768 9.018 0.000 9.001
0 16.226 3.066 0.518 1.202 0.812 8.611 0.000 8.598
0 12.134 3.796 3.579 1.378 0.690 10.323 0.000 10.311
0 17.496 2.817 0.523 1.267 1.160 9.416 0.000 9.403

Initial read:

The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

  • If syncReadMs is lower than asyncQueueReadPixelsMs and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
  • If syncReadMs is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

  • ready queue remained pinned at 0
  • underrun, zero-depth, late, and dropped counts continued increasing
  • asyncQueueMs and async readback counters were 0, confirming the experiment mode was active
  • direct syncReadMs was generally worse than the baseline PBO asyncQueueReadPixelsMs

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs syncReadMs asyncQueueMs syncFallbackCount
0 32.467 5.764 1.389 23.122 0.000 680
0 29.722 2.603 0.512 25.538 0.000 697
0 37.844 7.716 0.518 23.608 0.000 706
0 22.304 3.089 1.843 15.278 0.000 723
0 27.196 4.015 0.500 21.933 0.000 736

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

  • If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
  • If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

  • DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

  • renderRequestMs dropped to roughly 1-3 ms.
  • cachedCopyMs was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
  • asyncQueueMs, asyncQueueReadPixelsMs, syncReadMs, fence wait, map, and async copy were 0 after bootstrap.
  • syncFallbackCount stayed at 1, confirming one bootstrap readback.
  • cachedFallbackCount increased continuously, confirming ongoing frames were served from cached CPU memory.
  • late and dropped counts were 0 during the sampled run.
  • internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs cachedCopyMs asyncQueueMs syncReadMs late dropped
0 1.446 0.018 0.518 0.864 0.000 0.000 0 0
0 2.586 1.089 0.514 0.829 0.000 0.000 0 0
0 1.481 2.378 0.502 0.911 0.000 0.000 0 0
0 0.892 0.013 0.468 0.371 0.000 0.000 0 0
1 1.398 0.019 0.483 0.819 0.000 0.000 0 0

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.