# Phase 7.5 Readback Experiment Log This log tracks short readback experiments during the proactive playout timing work. ## How To Run The default debugger launch keeps the current production path: - `Debug LoopThroughWithOpenGLCompositing` - `VST_OUTPUT_READBACK_MODE` unset - mode: `async_pbo` Comparison modes are still available: - `VST_OUTPUT_READBACK_MODE=async_pbo` - uses the older PBO/fence readback path The experiment launches are: - `Debug LoopThroughWithOpenGLCompositing - sync readback experiment` - `VST_OUTPUT_READBACK_MODE=sync` - uses direct synchronous `glReadPixels()` every output frame - `Debug LoopThroughWithOpenGLCompositing - cached output experiment` - `VST_OUTPUT_READBACK_MODE=cached_only` - uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again. ## Experiment 3: fast_transfer Status: removed from active code after hardware sample Date: 2026-05-11 Change: - DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`. - Output frame buffers are owned by `PinnedMemoryAllocator`. - `VideoIOOutputFrame` carries a texture-transfer callback. - The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output. - If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback. Question: Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut? Result: - The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`. - The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`. - `GL_AMD_pinned_memory` was also unavailable. - The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead. ## Baseline: async_pbo Date: 2026-05-11 Observed while the app was running after adding the async queue split counters. Summary: - ready queue was pinned at 0 or briefly 1 - underrun, zero-depth, late, and dropped counts increased continuously - `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes - `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs` - PBO allocation/orphaning was effectively 0 ms Representative samples: | readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 | | 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 | | 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 | | 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 | Initial read: The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns. ## Experiment 1: sync Status: sampled Question: Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format? Expected interpretation: - If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format. - If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence. Results: Date: 2026-05-11 Summary: - ready queue remained pinned at 0 - underrun, zero-depth, late, and dropped counts continued increasing - `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active - direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs` Representative samples: | readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 | | 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 | | 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 | | 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 | | 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 | Read: Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling. ## Experiment 2: cached_only Status: sampled Question: If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0? Expected interpretation: - If ready depth rises and underruns slow or stop, readback is the primary bottleneck. - If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff. Results: Date: 2026-05-11 User-visible result: - DeckLink reported a healthy 5-frame buffer. Telemetry summary: - `renderRequestMs` dropped to roughly 1-3 ms. - `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms. - `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap. - `syncFallbackCount` stayed at 1, confirming one bootstrap readback. - `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory. - late and dropped counts were 0 during the sampled run. - internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer. Representative samples: | readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 | | 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 | | 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 | | 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 | | 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 | Read: Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy. The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth. ## Experiment 4: BGRA8 pack framebuffer async readback Status: sampled Date: 2026-05-11 Change: - The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback. - Async readback reads from the pack framebuffer using `GL_BGRA` / `GL_UNSIGNED_INT_8_8_8_8_REV`. - The deeper async PBO ring remains active. Question: Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side `glReadPixels` stall? User-visible result: - Long pauses appear to be gone. - Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes. Telemetry summary: - Throughput recovered to roughly real time in the sampled window. - Over 5 seconds, the app pushed and popped 305 output frames. - `asyncQueueReadPixelsMs` dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples. - `renderMs` usually sat around 2-5 ms in the sampled burst. - Late and dropped frame counts did not increase during the 5 second delta sample. - The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters. Representative samples: | readyDepth | renderMs | smoothedRenderMs | drawMs | mapMs | copyMs | asyncQueueReadPixelsMs | queueWaitMs | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 0 | 4.855 | 9.494 | 0.570 | 0.234 | 0.822 | 0.128 | 0.026 | | 0 | 1.957 | 9.041 | 0.468 | 0.139 | 0.604 | 0.048 | 0.016 | | 0 | 3.366 | 5.879 | 0.513 | 1.166 | 0.692 | 0.129 | 0.022 | | 0 | 5.208 | 6.492 | 2.209 | 1.358 | 0.714 | 0.090 | 0.061 | | 0 | 2.957 | 8.852 | 0.537 | 1.041 | 0.547 | 0.087 | 0.040 | Five-second delta: | pushed | popped | ready underruns | zero-depth samples | late delta | dropped delta | scheduled lead | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 305 | 305 | 129 | 671 | 0 | 0 | 20 | Read: The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.