Files
video-shader-toys/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
2026-05-12 00:52:33 +10:00

22 KiB

Phase 7.5 Readback Experiment Log

This log tracks short readback experiments during the proactive playout timing work.

The later experiments point to a larger ownership change rather than more local fallback tweaks. The proposed follow-up design is PHASE_7_7_RENDER_CADENCE_PLAYOUT_DESIGN.md.

How To Run

The default debugger launch keeps the current production path:

  • Debug LoopThroughWithOpenGLCompositing
  • VST_OUTPUT_READBACK_MODE unset
  • mode: async_pbo

Comparison modes are still available:

  • VST_OUTPUT_READBACK_MODE=async_pbo
  • uses the older PBO/fence readback path

The experiment launches are:

  • Debug LoopThroughWithOpenGLCompositing - sync readback experiment

  • VST_OUTPUT_READBACK_MODE=sync

  • uses direct synchronous glReadPixels() every output frame

  • Debug LoopThroughWithOpenGLCompositing - cached output experiment

  • VST_OUTPUT_READBACK_MODE=cached_only

  • uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback

The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.

Experiment 3: fast_transfer

Status: removed from active code after hardware sample

Date: 2026-05-11

Change:

  • DeckLink output frames are now created with CreateVideoFrameWithBuffer().
  • Output frame buffers are owned by PinnedMemoryAllocator.
  • VideoIOOutputFrame carries a texture-transfer callback.
  • The test branch changed the default render readback path to try VideoFrameTransfer::GPUtoCPU against the output texture for BGRA output.
  • If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.

Question:

Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?

Result:

  • The test machine reported GL_VENDOR=NVIDIA Corporation and GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2.
  • The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on GL_RENDERER containing Quadro.
  • GL_AMD_pinned_memory was also unavailable.
  • The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.

Baseline: async_pbo

Date: 2026-05-11

Observed while the app was running after adding the async queue split counters.

Summary:

  • ready queue was pinned at 0 or briefly 1
  • underrun, zero-depth, late, and dropped counts increased continuously
  • renderRequestMs usually sat around 16-25 ms, with occasional larger spikes
  • asyncQueueMs was mostly explained by asyncQueueReadPixelsMs
  • PBO allocation/orphaning was effectively 0 ms

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs mapMs copyMs asyncQueueMs asyncQueueBufferMs asyncQueueReadPixelsMs
0 24.915 3.018 0.510 0.923 0.768 9.018 0.000 9.001
0 16.226 3.066 0.518 1.202 0.812 8.611 0.000 8.598
0 12.134 3.796 3.579 1.378 0.690 10.323 0.000 10.311
0 17.496 2.817 0.523 1.267 1.160 9.416 0.000 9.403

Initial read:

The main repeated cost is issuing glReadPixels(..., nullptr) into the PBO. glBufferData, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.

Experiment 1: sync

Status: sampled

Question:

Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?

Expected interpretation:

  • If syncReadMs is lower than asyncQueueReadPixelsMs and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
  • If syncReadMs is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.

Results:

Date: 2026-05-11

Summary:

  • ready queue remained pinned at 0
  • underrun, zero-depth, late, and dropped counts continued increasing
  • asyncQueueMs and async readback counters were 0, confirming the experiment mode was active
  • direct syncReadMs was generally worse than the baseline PBO asyncQueueReadPixelsMs

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs syncReadMs asyncQueueMs syncFallbackCount
0 32.467 5.764 1.389 23.122 0.000 680
0 29.722 2.603 0.512 25.538 0.000 697
0 37.844 7.716 0.518 23.608 0.000 706
0 22.304 3.089 1.843 15.278 0.000 723
0 27.196 4.015 0.500 21.933 0.000 736

Read:

Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.

Experiment 2: cached_only

Status: sampled

Question:

If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?

Expected interpretation:

  • If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
  • If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.

Results:

Date: 2026-05-11

User-visible result:

  • DeckLink reported a healthy 5-frame buffer.

Telemetry summary:

  • renderRequestMs dropped to roughly 1-3 ms.
  • cachedCopyMs was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
  • asyncQueueMs, asyncQueueReadPixelsMs, syncReadMs, fence wait, map, and async copy were 0 after bootstrap.
  • syncFallbackCount stayed at 1, confirming one bootstrap readback.
  • cachedFallbackCount increased continuously, confirming ongoing frames were served from cached CPU memory.
  • late and dropped counts were 0 during the sampled run.
  • internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.

Representative samples:

readyDepth renderRequestMs queueWaitMs drawMs cachedCopyMs asyncQueueMs syncReadMs late dropped
0 1.446 0.018 0.518 0.864 0.000 0.000 0 0
0 2.586 1.089 0.514 0.829 0.000 0.000 0 0
0 1.481 2.378 0.502 0.911 0.000 0.000 0 0
0 0.892 0.013 0.468 0.371 0.000 0.000 0 0
1 1.398 0.019 0.483 0.819 0.000 0.000 0 0

Read:

Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.

The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

Experiment 4: BGRA8 pack framebuffer async readback

Status: sampled

Date: 2026-05-11

Change:

  • The output path now packs/blits the final output into a BGRA8-compatible framebuffer before readback.
  • Async readback reads from the pack framebuffer using GL_BGRA / GL_UNSIGNED_INT_8_8_8_8_REV.
  • The deeper async PBO ring remains active.

Question:

Does making the GPU output/readback format match the DeckLink BGRA8 scheduling format reduce the driver-side glReadPixels stall?

User-visible result:

  • Long pauses appear to be gone.
  • Playback still stutters, but the stutters look limited to a few frames rather than multi-second freezes.

Telemetry summary:

  • Throughput recovered to roughly real time in the sampled window.
  • Over 5 seconds, the app pushed and popped 305 output frames.
  • asyncQueueReadPixelsMs dropped from the earlier 8-14 ms range to roughly 0.05-0.13 ms in the representative samples.
  • renderMs usually sat around 2-5 ms in the sampled burst.
  • Late and dropped frame counts did not increase during the 5 second delta sample.
  • The ready queue still repeatedly touched 0 and accumulated underruns, which matches the remaining short stutters.

Representative samples:

readyDepth renderMs smoothedRenderMs drawMs mapMs copyMs asyncQueueReadPixelsMs queueWaitMs
0 4.855 9.494 0.570 0.234 0.822 0.128 0.026
0 1.957 9.041 0.468 0.139 0.604 0.048 0.016
0 3.366 5.879 0.513 1.166 0.692 0.129 0.022
0 5.208 6.492 2.209 1.358 0.714 0.090 0.061
0 2.957 8.852 0.537 1.041 0.547 0.087 0.040

Five-second delta:

pushed popped ready underruns zero-depth samples late delta dropped delta scheduled lead
305 305 129 671 0 0 20

Read:

The main readback stall appears to have been the previous format/path combination, not unavoidable BGRA8 bandwidth. The remaining problem now looks like cadence and buffering: the producer can average real-time throughput again, but the ready queue still runs empty often enough to create visible short stutters.

Experiment 5: producer burst-fill ready queue

Status: sampled

Date: 2026-05-12

Change:

  • The output producer now honors OutputProductionDecision::requestedFrames instead of always producing one frame per wake.
  • The producer no longer applies its wake-interval throttle while the ready queue is below target depth.
  • Completion fallback remains conservative; the background producer is responsible for building the cushion after immediate scheduling.

Question:

Now that BGRA8 readback is fast enough on average, can the producer maintain a small ready-frame cushion instead of hovering at zero?

Expected interpretation:

  • If short stutters reduce and readyQueue.depth spends more time above zero, the remaining issue was producer cadence/headroom.
  • If readyQueue.depth still remains pinned near zero, inspect render-thread contention next: preview present, input upload, runtime-event bursts, and live-state composition.
  • If render spikes increase, burst production may be overloading the shared render thread and should be tuned with a smaller target/depth policy.

Result:

  • User-visible playback looked about the same.
  • DeckLink reported a healthier 10-frame buffer.
  • The app ready queue now briefly reaches 1-3, but still often drains to 0.
  • No late, dropped, flushed, async-miss, or cached-fallback deltas were observed in the 8-second sample.
  • Readback remained fast.

Representative samples:

readyDepth renderMs smoothedRenderMs drawMs mapMs copyMs asyncQueueReadPixelsMs queueWaitMs
0 4.756 10.135 0.502 0.186 0.603 0.088 0.032
1 5.135 6.968 0.730 1.269 0.772 0.088 0.073
1 3.578 6.821 0.702 1.247 0.618 0.097 0.103
1 6.733 7.996 0.537 0.952 0.694 0.082 1.218
0 5.276 16.782 0.550 0.119 0.766 0.090 0.016

Eight-second delta:

pushed popped ready underruns zero-depth samples late delta dropped delta async misses cached fallbacks system scheduled
477 478 109 291 0 0 0 0 12

Read:

Burst filling improved device-side buffering but did not remove visible cadence issues. The remaining stutter is less likely to be raw output readback or device starvation. Next candidates are render-thread interference and pacing jitter: preview present, input upload, runtime-event/live-state bursts, and occasional completion/render spikes.

Experiment 6: producer work-before-sleep pacing

Status: ready for hardware test

Date: 2026-05-12

Change:

  • The output producer now checks ready-queue pressure before waiting on the producer condition variable.
  • When production is requested, the producer renders immediately instead of first sleeping for OutputProducerWakeInterval().
  • The wake interval remains as the idle/no-work sleep path, not as a mandatory pre-production delay.

Question:

Does removing the unconditional pre-check sleep let the producer rebuild queue headroom more quickly after a shallow-queue or focus-related disturbance?

Expected interpretation:

  • If DeckLink buffer depth is steadier and ready-queue underruns slow, the pre-production sleep was part of the cadence loss.
  • If the result is unchanged, the next likely culprit is render-thread interference rather than producer wake timing.
  • If CPU usage rises while playback does not improve, the producer may need a more explicit event/pacing model instead of tighter polling.

Experiment 7: remove just-in-time render from completion path

Status: ready for hardware test

Date: 2026-05-12

Change:

  • DeckLink completion processing no longer renders an output frame synchronously when the ready queue is empty.
  • Completion now schedules an already-ready frame if one exists, otherwise it uses the explicit underrun fallback and wakes the producer.
  • The producer is now solely responsible for rendering ahead and keeping the ready queue fed.

Question:

Does removing completion-time rendering make output cadence more stable by keeping DeckLink completion handling short and predictable?

Expected interpretation:

  • If playback improves or completion pacing spikes shrink, just-in-time rendering in the completion path was harming cadence.
  • If underrun/fallback counts increase, the producer still is not maintaining enough ready headroom.
  • If visible output gets worse but telemetry is clearer, implement a real repeat-last-system-frame fallback instead of rendering from completion.

Status: ready for hardware test

Date: 2026-05-12

Change:

  • VideoPlayoutPolicy::targetPrerollFrames is reduced from 12 to 4.
  • The system-memory frame pool remains larger than the DeckLink preroll so the producer can still build app-side ready headroom.

Question:

Can a smaller DeckLink scheduled buffer stay stable now that BGRA8 readback is fast and the producer is responsible for render-ahead?

Expected interpretation:

  • If DeckLink holds around 4 frames and playback cadence is acceptable, a large 10-12 frame device buffer is not required.
  • If focus changes or render-thread jitter drain DeckLink below 4, the next work should prioritize real device-buffer telemetry and render-thread interference.
  • If black flicker continues, it is the explicit underrun fallback being exposed by the no-JIT completion path, not a lack of DeckLink preroll alone.

Experiment 9: no steady-state black fallback

Status: sampled

Date: 2026-05-12

Change:

  • Normal DeckLink completion processing no longer schedules a black fallback frame when the app ready queue is empty.
  • RenderOutputQueue::TryPop() still records the app-ready underrun.
  • The producer is woken and the existing DeckLink scheduled buffer is allowed to carry playback.
  • The four-frame DeckLink preroll experiment remains active.

Question:

Was the visible black flicker caused by treating an app-ready queue miss as immediate device starvation?

Expected interpretation:

  • If black flicker disappears while app-ready underruns still increase, the fallback was too aggressive and should stay out of the steady-state path.
  • If DeckLink buffer drains or late/dropped frames increase, we need real device-buffer telemetry and a controlled emergency policy.
  • If visible stutter remains without black, the next work is cadence attribution: preview present, input upload, render-thread priority, and actual DeckLink buffered-frame count.

Result:

  • Playback was smooth briefly, then froze once the DeckLink buffer reached 0.
  • The buffer did not refill.

Read:

Removing the black fallback exposed another completion-driven assumption. The producer could render into the app ready queue, but scheduling still happened only from completion processing. Once the scheduled DeckLink buffer reached 0, completions stopped, so no later trigger scheduled the producer's ready frames.

Experiment 10: producer-side scheduling

Status: sampled

Date: 2026-05-12

Change:

  • The producer now schedules the frames it produces instead of waiting for a future DeckLink completion to schedule them.
  • A dedicated output scheduling mutex serializes scheduling calls from the producer and completion worker.
  • The four-frame DeckLink preroll and no steady-state black fallback experiments remain active.

Question:

Can the producer maintain the four-frame DeckLink buffer without relying on completion-time rendering or black fallback insertion?

Expected interpretation:

  • If the buffer refills and playback no longer freezes, producer-side scheduling is required for a real proactive playout model.
  • If black flicker is gone but stutter remains, focus on render-thread jitter and actual device-buffer telemetry.
  • If the buffer overfills or scheduling timing becomes odd, add real DeckLink buffered-frame telemetry and schedule only up to a measured target.

Result:

  • The DeckLink buffer stayed full.
  • Playback had a low-framerate look.
  • Over a 6-second sample, pushedDelta and poppedDelta were 310, but underrunDelta was also 310.
  • Late and dropped counts increased.
  • Synthetic scheduled lead grew very large, indicating producer-side scheduling was running too far ahead of the intended four-frame cushion.

Read:

Producer-side scheduling is required, but it must be capped by a real scheduling target. Scheduling every produced frame overfeeds the scheduler timeline and can produce odd cadence even when the DeckLink buffer appears full.

Experiment 11: cap producer scheduling to preroll target

Status: sampled

Date: 2026-05-12

Change:

  • The producer still renders proactively.
  • After production, it schedules ready frames only until the system-memory scheduled count reaches VideoPlayoutPolicy::targetPrerollFrames.
  • With the current experiment settings, that target remains four frames.

Question:

Can producer-side scheduling keep the four-frame buffer fed without running hundreds of frames ahead in scheduler time?

Expected interpretation:

  • If the low-framerate look disappears and the buffer stays around four, producer scheduling needed a cap.
  • If the buffer drains, the cap needs actual DeckLink GetBufferedVideoFrameCount() telemetry rather than system-memory scheduled-count approximation.
  • If stutter remains with sane lead, investigate render-thread interference next.

Result:

  • Playback still had the low-framerate look.
  • The system-memory scheduled count held at the four-frame target.
  • Synthetic scheduled lead still grew, with scheduled frame index advancing faster than completed frame index.

Read:

The cap was active, but completion and producer were both still scheduling ready frames. The result was still over-scheduling relative to completions, even though the system-memory scheduled count stayed at four.

Experiment 12: producer owns steady-state scheduling

Status: sampled

Date: 2026-05-12

Change:

  • Completion processing now releases completed frames, records telemetry, and wakes the producer only.
  • Completion no longer schedules from the ready queue during steady state.
  • Producer-side scheduling remains capped to targetPrerollFrames.

Question:

Does having a single steady-state scheduler stop the schedule timeline from running ahead and recover normal cadence?

Expected interpretation:

  • If scheduled lead stops growing and playback cadence improves, duplicate completion/producer scheduling was the low-framerate cause.
  • If the buffer drains, the producer wake/schedule loop is still not responsive enough.
  • If lead still grows, inspect VideoPlayoutScheduler catch-up accounting next.

Result:

  • Playback froze on startup.
  • Telemetry showed rendered ready frames in the app ready queue, but zero system-memory frames scheduled.

Read:

Removing completion-side scheduling exposed another producer-loop gap. The producer only scheduled immediately after producing frames. Once the ready queue reached its max depth, production stopped, and the already-ready frames were never handed to DeckLink.

Experiment 13: producer top-up scheduling before production

Status: pending hardware build

Date: 2026-05-12

Change:

  • The producer now attempts to top up DeckLink scheduling from already-ready frames before deciding whether to render more frames.
  • The producer also top-ups after successful production.
  • Completion remains release/record/wake only.

Question:

Can the producer own steady-state scheduling without freezing when ready frames already exist?

Expected interpretation:

  • If startup no longer freezes and the four-frame buffer stays stable, the producer needed an explicit schedule-before-produce pass.
  • If cadence is still wrong, the next target is scheduler timeline accounting or actual DeckLink buffered-frame telemetry.

Result:

  • Playback alternated between smooth playback and freezes.
  • The app ready queue was no longer starving; it held around 3-4 frames and had no new ready underruns in the sampled delta.
  • Late and dropped counts increased.
  • scheduledIndexDelta was much larger than completedIndexDelta, even with producer scheduling capped.

Read:

The proactive producer now feeds the app queue, but VideoPlayoutScheduler catch-up accounting still advances scheduled stream time on late/drop recovery. That creates timeline gaps and produces the smooth/freeze/smooth cadence.

Experiment 14: disable late/drop catch-up skipping

Status: pending hardware build

Date: 2026-05-12

Change:

  • VideoPlayoutPolicy::lateOrDropCatchUpFrames is set to 0.
  • Late/drop results should still be reported, but the scheduler should not advance mScheduledFrameIndex by extra catch-up frames.

Question:

Does removing schedule-time skipping stop the smooth/freeze cadence now that the producer owns steady-state scheduling?

Expected interpretation:

  • If scheduledIndexDelta closely matches actual scheduled/completed frame flow and playback smooths out, catch-up skipping was harmful in proactive mode.
  • If late/dropped counts still climb without catch-up, inspect actual DeckLink buffered-frame count and render-thread interference.