Performance chasing
All checks were successful
CI / React UI Build (push) Successful in 10s
CI / Native Windows Build And Tests (push) Successful in 2m51s
CI / Windows Release Package (push) Successful in 2m55s

This commit is contained in:
Aiden
2026-05-11 23:10:45 +10:00
parent c5cead6003
commit a434a88108
18 changed files with 1115 additions and 82 deletions

View File

@@ -7,7 +7,7 @@ Phase 7 made backend lifecycle, playout policy, ready-frame queueing, late/drop
## Status
- Phase 7.5 design package: proposed.
- Phase 7.5 implementation: Step 2 in progress.
- Phase 7.5 implementation: Step 5 in progress.
- Current alignment: Phase 7 is complete. `RenderOutputQueue`, `VideoPlayoutPolicy`, `VideoPlayoutScheduler`, `VideoBackendLifecycle`, and backend playout telemetry exist. The backend worker fills the ready queue on completion demand, but render production is not yet proactively driven by queue pressure or video cadence.
Current footholds:
@@ -19,6 +19,9 @@ Current footholds:
- `HealthTelemetry::BackendPlayoutSnapshot` exposes queue depth, underruns, late/drop streaks, and recovery decisions.
- Step 1 adds baseline timing fields for ready-queue min/max/zero-depth samples and output render duration.
- Step 2 adds a pure `OutputProductionController` for queue-pressure production decisions.
- Step 3 adds a proactive output producer worker that keeps `RenderOutputQueue` warm after playback starts.
- Step 4 skips non-forced preview presentation while output ready-queue depth is below target.
- Step 5 makes async readback misses prefer cached output over synchronous readback after bootstrap.
## Timing Review Findings
@@ -199,15 +202,23 @@ Move from demand-filled output production to queue-pressure production.
Initial target:
- producer wakes when queue depth is below target
- producer requests render-thread output production until target depth is reached
- producer stops when backend stops or render thread shuts down
- completion worker mostly schedules from already-ready frames
- [x] producer wakes when queue depth is below target
- [x] producer requests render-thread output production until target depth is reached
- [x] producer stops when backend stops or render thread shuts down
- [x] completion worker mostly schedules from already-ready frames
Exit criteria:
- normal playback does not depend on completion processing to fill the queue from empty
- callback/completion pressure and render production pressure are separate
- [x] normal playback does not depend on completion processing to fill the queue from empty
- [x] callback/completion pressure and render production pressure are separate
Implementation notes:
- `VideoBackend` starts the completion worker before device start, then starts the output producer only after DeckLink start succeeds. This avoids fighting DeckLink preroll for the same output frame pool.
- `OutputProducerWorkerMain()` periodically wakes and uses `OutputProductionController` to decide whether to produce, wait, or throttle.
- Completion handling records pacing/recovery, updates producer pressure, schedules a ready frame, and wakes the producer to refill headroom.
- Completion handling keeps a one-frame synchronous fallback when the ready queue is unexpectedly empty, then falls back to black underrun behavior if that also fails.
- Producer shutdown is explicit and joined before video output teardown.
### Step 4. Prioritize Playout Over Preview
@@ -215,15 +226,21 @@ Make preview explicitly subordinate to output playout deadlines.
Initial target:
- skip or delay preview when ready queue depth is below target
- [x] skip or delay preview when ready queue depth is below target
- count skipped previews
- record preview present cost separately from output render cost
Exit criteria:
- preview cannot drain output headroom invisibly
- [x] preview cannot drain output headroom invisibly
- runtime telemetry shows preview skips and preview present cost
Implementation notes:
- `OpenGLComposite::paintGL(false)` now skips preview presentation when `VideoBackend` reports that the ready queue is below the target depth.
- Forced preview paints are still allowed so resize/manual paint behavior remains intact.
- Preview skip counters and present-cost telemetry remain follow-up work for this step.
### Step 5. Make Readback Miss Policy Deadline-Aware
Avoid turning a late async readback fence into synchronous deadline pressure by default.
@@ -232,13 +249,20 @@ Initial target:
- count async readback misses
- count synchronous fallback uses
- allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
- keep current fallback available while behavior is measured
- [x] allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
- [x] keep current fallback available while behavior is measured
Exit criteria:
- readback fallback is an explicit policy decision
- late GPU fences do not automatically block the most timing-sensitive path
- [x] readback fallback is an explicit policy decision
- [x] late GPU fences do not automatically block the most timing-sensitive path
Implementation notes:
- `OpenGLRenderPipeline::ReadOutputFrame()` now uses synchronous readback only to bootstrap the first cached output frame.
- After cached output exists, an async readback miss copies the cached output frame into the DeckLink output frame instead of blocking on synchronous `glReadPixels`.
- Async readback queueing now skips when the next PBO slot is still in flight rather than deleting an in-flight fence and overwriting it.
- Miss/fallback counters remain follow-up telemetry work for this step.
### Step 6. Tune Headroom Policy

View File

@@ -0,0 +1,165 @@
# Phase 7.5 Readback Experiment Log
This log tracks short readback experiments during the proactive playout timing work.
## How To Run
The default debugger launch keeps the current production path:
- `Debug LoopThroughWithOpenGLCompositing`
- `VST_OUTPUT_READBACK_MODE` unset
- mode: `async_pbo`
Comparison modes are still available:
- `VST_OUTPUT_READBACK_MODE=async_pbo`
- uses the older PBO/fence readback path
The experiment launches are:
- `Debug LoopThroughWithOpenGLCompositing - sync readback experiment`
- `VST_OUTPUT_READBACK_MODE=sync`
- uses direct synchronous `glReadPixels()` every output frame
- `Debug LoopThroughWithOpenGLCompositing - cached output experiment`
- `VST_OUTPUT_READBACK_MODE=cached_only`
- uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback
The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.
## Experiment 3: fast_transfer
Status: removed from active code after hardware sample
Date: 2026-05-11
Change:
- DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`.
- Output frame buffers are owned by `PinnedMemoryAllocator`.
- `VideoIOOutputFrame` carries a texture-transfer callback.
- The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output.
- If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.
Question:
Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?
Result:
- The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`.
- The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`.
- `GL_AMD_pinned_memory` was also unavailable.
- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.
## Baseline: async_pbo
Date: 2026-05-11
Observed while the app was running after adding the async queue split counters.
Summary:
- ready queue was pinned at 0 or briefly 1
- underrun, zero-depth, late, and dropped counts increased continuously
- `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes
- `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs`
- PBO allocation/orphaning was effectively 0 ms
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |
Initial read:
The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.
## Experiment 1: sync
Status: sampled
Question:
Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?
Expected interpretation:
- If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
- If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.
Results:
Date: 2026-05-11
Summary:
- ready queue remained pinned at 0
- underrun, zero-depth, late, and dropped counts continued increasing
- `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active
- direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs`
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |
Read:
Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.
## Experiment 2: cached_only
Status: sampled
Question:
If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?
Expected interpretation:
- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.
Results:
Date: 2026-05-11
User-visible result:
- DeckLink reported a healthy 5-frame buffer.
Telemetry summary:
- `renderRequestMs` dropped to roughly 1-3 ms.
- `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
- `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap.
- `syncFallbackCount` stayed at 1, confirming one bootstrap readback.
- `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory.
- late and dropped counts were 0 during the sampled run.
- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.
Representative samples:
| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |
Read:
Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.
The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.

View File

@@ -363,6 +363,10 @@ components:
$ref: "#/components/schemas/VideoIOStatus"
performance:
$ref: "#/components/schemas/PerformanceStatus"
backendPlayout:
$ref: "#/components/schemas/BackendPlayoutStatus"
runtimeEvents:
$ref: "#/components/schemas/RuntimeEventStatus"
shaders:
type: array
items:
@@ -382,10 +386,16 @@ components:
type: number
oscPort:
type: number
oscBindAddress:
type: string
oscSmoothing:
type: number
autoReload:
type: boolean
maxTemporalHistoryFrames:
type: number
previewFps:
type: number
enableExternalKeying:
type: boolean
inputVideoFormat:
@@ -478,6 +488,175 @@ components:
type: number
flushedFrameCount:
type: number
BackendPlayoutStatus:
type: object
properties:
lifecycleState:
type: string
example: running
degraded:
type: boolean
statusMessage:
type: string
lateFrameCount:
type: number
droppedFrameCount:
type: number
flushedFrameCount:
type: number
readyQueue:
$ref: "#/components/schemas/BackendReadyQueueStatus"
outputRender:
$ref: "#/components/schemas/BackendOutputRenderStatus"
recovery:
$ref: "#/components/schemas/BackendPlayoutRecoveryStatus"
BackendReadyQueueStatus:
type: object
properties:
depth:
type: number
description: Current number of ready output frames.
capacity:
type: number
description: Maximum ready output frames currently allowed.
minDepth:
type: number
description: Minimum observed ready queue depth since backend worker start.
maxDepth:
type: number
description: Maximum observed ready queue depth since backend worker start.
zeroDepthCount:
type: number
description: Number of observed samples where the ready queue was empty.
pushedCount:
type: number
poppedCount:
type: number
droppedCount:
type: number
underrunCount:
type: number
BackendOutputRenderStatus:
type: object
properties:
renderMs:
type: number
description: Most recent output render duration in milliseconds.
smoothedRenderMs:
type: number
description: Smoothed output render duration in milliseconds.
maxRenderMs:
type: number
description: Maximum observed output render duration in milliseconds.
acquireFrameMs:
type: number
description: Time spent acquiring a writable backend output frame in milliseconds.
renderRequestMs:
type: number
description: Time spent executing the render-thread output frame request in milliseconds.
endAccessMs:
type: number
description: Time spent ending write access to the backend output frame in milliseconds.
queueWaitMs:
type: number
description: Time the output render request spent waiting for the render thread in milliseconds.
drawMs:
type: number
description: Time spent drawing, blitting, packing, and flushing the output frame in milliseconds.
fenceWaitMs:
type: number
description: Time spent waiting for the async readback fence in milliseconds.
mapMs:
type: number
description: Time spent mapping the async readback pixel buffer in milliseconds.
readbackCopyMs:
type: number
description: Time spent copying async readback bytes into the backend output frame in milliseconds.
cachedCopyMs:
type: number
description: Time spent copying the cached output frame when async readback is not ready in milliseconds.
asyncQueueMs:
type: number
description: Time spent queueing the next async readback in milliseconds.
asyncQueueBufferMs:
type: number
description: Time spent orphaning or allocating the async readback pixel buffer in milliseconds.
asyncQueueSetupMs:
type: number
description: Time spent applying readback pixel-store, framebuffer, and pixel-pack-buffer state in milliseconds.
asyncQueueReadPixelsMs:
type: number
description: Time spent issuing glReadPixels for the async readback in milliseconds.
asyncQueueFenceMs:
type: number
description: Time spent creating the async readback fence in milliseconds.
syncReadMs:
type: number
description: Time spent in bootstrap synchronous readback in milliseconds.
asyncReadbackMissCount:
type: number
description: Count of output render requests where async readback was not ready.
cachedFallbackCount:
type: number
description: Count of output render requests served from the cached output frame.
syncFallbackCount:
type: number
description: Count of output render requests that used bootstrap synchronous readback.
BackendPlayoutRecoveryStatus:
type: object
properties:
completionResult:
type: string
enum: [Completed, DisplayedLate, Dropped, Flushed, Unknown]
completedFrameIndex:
type: number
scheduledFrameIndex:
type: number
scheduledLeadFrames:
type: number
measuredLagFrames:
type: number
catchUpFrames:
type: number
lateStreak:
type: number
dropStreak:
type: number
RuntimeEventStatus:
type: object
properties:
queue:
$ref: "#/components/schemas/RuntimeEventQueueStatus"
dispatch:
$ref: "#/components/schemas/RuntimeEventDispatchStatus"
RuntimeEventQueueStatus:
type: object
properties:
name:
type: string
depth:
type: number
capacity:
type: number
droppedCount:
type: number
oldestEventAgeMs:
type: number
RuntimeEventDispatchStatus:
type: object
properties:
dispatchCallCount:
type: number
dispatchedEventCount:
type: number
handlerInvocationCount:
type: number
handlerFailureCount:
type: number
lastDispatchDurationMs:
type: number
maxDispatchDurationMs:
type: number
ShaderSummary:
type: object
properties:
@@ -497,6 +676,8 @@ components:
description: Error text for unavailable shader packages.
temporal:
$ref: "#/components/schemas/TemporalState"
feedback:
$ref: "#/components/schemas/FeedbackState"
TemporalState:
type: object
properties:
@@ -509,6 +690,13 @@ components:
type: number
effectiveHistoryLength:
type: number
FeedbackState:
type: object
properties:
enabled:
type: boolean
writePass:
type: string
LayerState:
type: object
properties: