video-shader-toys/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md

# Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

## How To Run

The default debugger launch keeps the current production path:

- `Debug LoopThroughWithOpenGLCompositing`
- `VST_OUTPUT_READBACK_MODE` unset
- mode: `async_pbo`

Comparison modes are still available:

- `VST_OUTPUT_READBACK_MODE=async_pbo`
- uses the older PBO/fence readback path

The experiment launches are:

- `Debug LoopThroughWithOpenGLCompositing - sync readback experiment`
- `VST_OUTPUT_READBACK_MODE=sync`
- uses direct synchronous `glReadPixels()` every output frame

- `Debug LoopThroughWithOpenGLCompositing - cached output experiment`
- `VST_OUTPUT_READBACK_MODE=cached_only`
- uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

## Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

- DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`.
- Output frame buffers are owned by `PinnedMemoryAllocator`.
- `VideoIOOutputFrame` carries a texture-transfer callback.
- The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output.
- If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

- The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`.
- The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`.
- `GL_AMD_pinned_memory` was also unavailable.
- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

## Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

- ready queue was pinned at 0 or briefly 1
- underrun, zero-depth, late, and dropped counts increased continuously
- `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes
- `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs`
- PBO allocation/orphaning was effectively 0 ms

Representative samples:

| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |

Initial read:

The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

## Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

- If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
- If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

- ready queue remained pinned at 0
- underrun, zero-depth, late, and dropped counts continued increasing
- `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active
- direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs`

Representative samples:

| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

## Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

- DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

- `renderRequestMs` dropped to roughly 1-3 ms.
- `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
- `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap.
- `syncFallbackCount` stayed at 1, confirming one bootstrap readback.
- `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory.
- late and dropped counts were 0 during the sampled run.
- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

## Experiment 4: BGRA8 pack framebuffer async readback

Status: sampled

Date: 2026-05-11

Change:

- The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback.
- Async readback reads from the pack framebuffer using `GL_BGRA` / `GL_UNSIGNED_INT_8_8_8_8_REV`.
- The deeper async PBO ring remains active.

Question:

Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side `glReadPixels` stall?

User-visible result:

- Long pauses appear to be gone.
- Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes.

Telemetry summary:

- Throughput recovered to roughly real time in the sampled window.
- Over 5 seconds, the app pushed and popped 305 output frames.
- `asyncQueueReadPixelsMs` dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples.
- `renderMs` usually sat around 2-5 ms in the sampled burst.
- Late and dropped frame counts did not increase during the 5 second delta sample.
- The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters.

Representative samples:

| readyDepth | renderMs | smoothedRenderMs | drawMs | mapMs | copyMs | asyncQueueReadPixelsMs | queueWaitMs |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 4.855 | 9.494 | 0.570 | 0.234 | 0.822 | 0.128 | 0.026 |
| 0 | 1.957 | 9.041 | 0.468 | 0.139 | 0.604 | 0.048 | 0.016 |
| 0 | 3.366 | 5.879 | 0.513 | 1.166 | 0.692 | 0.129 | 0.022 |
| 0 | 5.208 | 6.492 | 2.209 | 1.358 | 0.714 | 0.090 | 0.061 |
| 0 | 2.957 | 8.852 | 0.537 | 1.041 | 0.547 | 0.087 | 0.040 |

Five-second delta:

| pushed | popped | ready underruns | zero-depth samples | late delta | dropped delta | scheduled lead |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 305 | 305 | 129 | 671 | 0 | 0 | 20 |

Read:

The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.