216 lines
9.1 KiB
Markdown
216 lines
9.1 KiB
Markdown
# Phase 7.5 Readback Experiment Log
|
|
|
|
This log tracks short readback experiments during the proactive playout timing work.
|
|
|
|
## How To Run
|
|
|
|
The default debugger launch keeps the current production path:
|
|
|
|
- `Debug LoopThroughWithOpenGLCompositing`
|
|
- `VST_OUTPUT_READBACK_MODE` unset
|
|
- mode: `async_pbo`
|
|
|
|
Comparison modes are still available:
|
|
|
|
- `VST_OUTPUT_READBACK_MODE=async_pbo`
|
|
- uses the older PBO/fence readback path
|
|
|
|
The experiment launches are:
|
|
|
|
- `Debug LoopThroughWithOpenGLCompositing - sync readback experiment`
|
|
- `VST_OUTPUT_READBACK_MODE=sync`
|
|
- uses direct synchronous `glReadPixels()` every output frame
|
|
|
|
- `Debug LoopThroughWithOpenGLCompositing - cached output experiment`
|
|
- `VST_OUTPUT_READBACK_MODE=cached_only`
|
|
- uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback
|
|
|
|
The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.
|
|
|
|
## Experiment 3: fast_transfer
|
|
|
|
Status: removed from active code after hardware sample
|
|
|
|
Date: 2026-05-11
|
|
|
|
Change:
|
|
|
|
- DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`.
|
|
- Output frame buffers are owned by `PinnedMemoryAllocator`.
|
|
- `VideoIOOutputFrame` carries a texture-transfer callback.
|
|
- The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output.
|
|
- If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.
|
|
|
|
Question:
|
|
|
|
Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?
|
|
|
|
Result:
|
|
|
|
- The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`.
|
|
- The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`.
|
|
- `GL_AMD_pinned_memory` was also unavailable.
|
|
- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.
|
|
|
|
## Baseline: async_pbo
|
|
|
|
Date: 2026-05-11
|
|
|
|
Observed while the app was running after adding the async queue split counters.
|
|
|
|
Summary:
|
|
|
|
- ready queue was pinned at 0 or briefly 1
|
|
- underrun, zero-depth, late, and dropped counts increased continuously
|
|
- `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes
|
|
- `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs`
|
|
- PBO allocation/orphaning was effectively 0 ms
|
|
|
|
Representative samples:
|
|
|
|
| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
|
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
|
|
| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
|
|
| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
|
|
| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |
|
|
|
|
Initial read:
|
|
|
|
The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.
|
|
|
|
## Experiment 1: sync
|
|
|
|
Status: sampled
|
|
|
|
Question:
|
|
|
|
Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?
|
|
|
|
Expected interpretation:
|
|
|
|
- If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
|
|
- If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.
|
|
|
|
Results:
|
|
|
|
Date: 2026-05-11
|
|
|
|
Summary:
|
|
|
|
- ready queue remained pinned at 0
|
|
- underrun, zero-depth, late, and dropped counts continued increasing
|
|
- `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active
|
|
- direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs`
|
|
|
|
Representative samples:
|
|
|
|
| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
|
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
|
|
| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
|
|
| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
|
|
| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
|
|
| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |
|
|
|
|
Read:
|
|
|
|
Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.
|
|
|
|
## Experiment 2: cached_only
|
|
|
|
Status: sampled
|
|
|
|
Question:
|
|
|
|
If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?
|
|
|
|
Expected interpretation:
|
|
|
|
- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
|
|
- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.
|
|
|
|
Results:
|
|
|
|
Date: 2026-05-11
|
|
|
|
User-visible result:
|
|
|
|
- DeckLink reported a healthy 5-frame buffer.
|
|
|
|
Telemetry summary:
|
|
|
|
- `renderRequestMs` dropped to roughly 1-3 ms.
|
|
- `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
|
|
- `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap.
|
|
- `syncFallbackCount` stayed at 1, confirming one bootstrap readback.
|
|
- `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory.
|
|
- late and dropped counts were 0 during the sampled run.
|
|
- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.
|
|
|
|
Representative samples:
|
|
|
|
| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
|
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
|
|
| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
|
|
| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
|
|
| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
|
|
| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |
|
|
|
|
Read:
|
|
|
|
Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.
|
|
|
|
The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.
|
|
|
|
## Experiment 4: BGRA8 pack framebuffer async readback
|
|
|
|
Status: sampled
|
|
|
|
Date: 2026-05-11
|
|
|
|
Change:
|
|
|
|
- The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback.
|
|
- Async readback reads from the pack framebuffer using `GL_BGRA` / `GL_UNSIGNED_INT_8_8_8_8_REV`.
|
|
- The deeper async PBO ring remains active.
|
|
|
|
Question:
|
|
|
|
Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side `glReadPixels` stall?
|
|
|
|
User-visible result:
|
|
|
|
- Long pauses appear to be gone.
|
|
- Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes.
|
|
|
|
Telemetry summary:
|
|
|
|
- Throughput recovered to roughly real time in the sampled window.
|
|
- Over 5 seconds, the app pushed and popped 305 output frames.
|
|
- `asyncQueueReadPixelsMs` dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples.
|
|
- `renderMs` usually sat around 2-5 ms in the sampled burst.
|
|
- Late and dropped frame counts did not increase during the 5 second delta sample.
|
|
- The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters.
|
|
|
|
Representative samples:
|
|
|
|
| readyDepth | renderMs | smoothedRenderMs | drawMs | mapMs | copyMs | asyncQueueReadPixelsMs | queueWaitMs |
|
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| 0 | 4.855 | 9.494 | 0.570 | 0.234 | 0.822 | 0.128 | 0.026 |
|
|
| 0 | 1.957 | 9.041 | 0.468 | 0.139 | 0.604 | 0.048 | 0.016 |
|
|
| 0 | 3.366 | 5.879 | 0.513 | 1.166 | 0.692 | 0.129 | 0.022 |
|
|
| 0 | 5.208 | 6.492 | 2.209 | 1.358 | 0.714 | 0.090 | 0.061 |
|
|
| 0 | 2.957 | 8.852 | 0.537 | 1.041 | 0.547 | 0.087 | 0.040 |
|
|
|
|
Five-second delta:
|
|
|
|
| pushed | popped | ready underruns | zero-depth samples | late delta | dropped delta | scheduled lead |
|
|
| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| 305 | 305 | 129 | 671 | 0 | 0 | 20 |
|
|
|
|
Read:
|
|
|
|
The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.
|