Performance chasing
This commit is contained in:
@@ -7,7 +7,7 @@ Phase 7 made backend lifecycle, playout policy, ready-frame queueing, late/drop
|
||||
## Status
|
||||
|
||||
- Phase 7.5 design package: proposed.
|
||||
- Phase 7.5 implementation: Step 2 in progress.
|
||||
- Phase 7.5 implementation: Step 5 in progress.
|
||||
- Current alignment: Phase 7 is complete. `RenderOutputQueue`, `VideoPlayoutPolicy`, `VideoPlayoutScheduler`, `VideoBackendLifecycle`, and backend playout telemetry exist. The backend worker fills the ready queue on completion demand, but render production is not yet proactively driven by queue pressure or video cadence.
|
||||
|
||||
Current footholds:
|
||||
@@ -19,6 +19,9 @@ Current footholds:
|
||||
- `HealthTelemetry::BackendPlayoutSnapshot` exposes queue depth, underruns, late/drop streaks, and recovery decisions.
|
||||
- Step 1 adds baseline timing fields for ready-queue min/max/zero-depth samples and output render duration.
|
||||
- Step 2 adds a pure `OutputProductionController` for queue-pressure production decisions.
|
||||
- Step 3 adds a proactive output producer worker that keeps `RenderOutputQueue` warm after playback starts.
|
||||
- Step 4 skips non-forced preview presentation while output ready-queue depth is below target.
|
||||
- Step 5 makes async readback misses prefer cached output over synchronous readback after bootstrap.
|
||||
|
||||
## Timing Review Findings
|
||||
|
||||
@@ -199,15 +202,23 @@ Move from demand-filled output production to queue-pressure production.
|
||||
|
||||
Initial target:
|
||||
|
||||
- producer wakes when queue depth is below target
|
||||
- producer requests render-thread output production until target depth is reached
|
||||
- producer stops when backend stops or render thread shuts down
|
||||
- completion worker mostly schedules from already-ready frames
|
||||
- [x] producer wakes when queue depth is below target
|
||||
- [x] producer requests render-thread output production until target depth is reached
|
||||
- [x] producer stops when backend stops or render thread shuts down
|
||||
- [x] completion worker mostly schedules from already-ready frames
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- normal playback does not depend on completion processing to fill the queue from empty
|
||||
- callback/completion pressure and render production pressure are separate
|
||||
- [x] normal playback does not depend on completion processing to fill the queue from empty
|
||||
- [x] callback/completion pressure and render production pressure are separate
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- `VideoBackend` starts the completion worker before device start, then starts the output producer only after DeckLink start succeeds. This avoids fighting DeckLink preroll for the same output frame pool.
|
||||
- `OutputProducerWorkerMain()` periodically wakes and uses `OutputProductionController` to decide whether to produce, wait, or throttle.
|
||||
- Completion handling records pacing/recovery, updates producer pressure, schedules a ready frame, and wakes the producer to refill headroom.
|
||||
- Completion handling keeps a one-frame synchronous fallback when the ready queue is unexpectedly empty, then falls back to black underrun behavior if that also fails.
|
||||
- Producer shutdown is explicit and joined before video output teardown.
|
||||
|
||||
### Step 4. Prioritize Playout Over Preview
|
||||
|
||||
@@ -215,15 +226,21 @@ Make preview explicitly subordinate to output playout deadlines.
|
||||
|
||||
Initial target:
|
||||
|
||||
- skip or delay preview when ready queue depth is below target
|
||||
- [x] skip or delay preview when ready queue depth is below target
|
||||
- count skipped previews
|
||||
- record preview present cost separately from output render cost
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- preview cannot drain output headroom invisibly
|
||||
- [x] preview cannot drain output headroom invisibly
|
||||
- runtime telemetry shows preview skips and preview present cost
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- `OpenGLComposite::paintGL(false)` now skips preview presentation when `VideoBackend` reports that the ready queue is below the target depth.
|
||||
- Forced preview paints are still allowed so resize/manual paint behavior remains intact.
|
||||
- Preview skip counters and present-cost telemetry remain follow-up work for this step.
|
||||
|
||||
### Step 5. Make Readback Miss Policy Deadline-Aware
|
||||
|
||||
Avoid turning a late async readback fence into synchronous deadline pressure by default.
|
||||
@@ -232,13 +249,20 @@ Initial target:
|
||||
|
||||
- count async readback misses
|
||||
- count synchronous fallback uses
|
||||
- allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
|
||||
- keep current fallback available while behavior is measured
|
||||
- [x] allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
|
||||
- [x] keep current fallback available while behavior is measured
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- readback fallback is an explicit policy decision
|
||||
- late GPU fences do not automatically block the most timing-sensitive path
|
||||
- [x] readback fallback is an explicit policy decision
|
||||
- [x] late GPU fences do not automatically block the most timing-sensitive path
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- `OpenGLRenderPipeline::ReadOutputFrame()` now uses synchronous readback only to bootstrap the first cached output frame.
|
||||
- After cached output exists, an async readback miss copies the cached output frame into the DeckLink output frame instead of blocking on synchronous `glReadPixels`.
|
||||
- Async readback queueing now skips when the next PBO slot is still in flight rather than deleting an in-flight fence and overwriting it.
|
||||
- Miss/fallback counters remain follow-up telemetry work for this step.
|
||||
|
||||
### Step 6. Tune Headroom Policy
|
||||
|
||||
|
||||
165
docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
Normal file
165
docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Phase 7.5 Readback Experiment Log
|
||||
|
||||
This log tracks short readback experiments during the proactive playout timing work.
|
||||
|
||||
## How To Run
|
||||
|
||||
The default debugger launch keeps the current production path:
|
||||
|
||||
- `Debug LoopThroughWithOpenGLCompositing`
|
||||
- `VST_OUTPUT_READBACK_MODE` unset
|
||||
- mode: `async_pbo`
|
||||
|
||||
Comparison modes are still available:
|
||||
|
||||
- `VST_OUTPUT_READBACK_MODE=async_pbo`
|
||||
- uses the older PBO/fence readback path
|
||||
|
||||
The experiment launches are:
|
||||
|
||||
- `Debug LoopThroughWithOpenGLCompositing - sync readback experiment`
|
||||
- `VST_OUTPUT_READBACK_MODE=sync`
|
||||
- uses direct synchronous `glReadPixels()` every output frame
|
||||
|
||||
- `Debug LoopThroughWithOpenGLCompositing - cached output experiment`
|
||||
- `VST_OUTPUT_READBACK_MODE=cached_only`
|
||||
- uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback
|
||||
|
||||
The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.
|
||||
|
||||
## Experiment 3: fast_transfer
|
||||
|
||||
Status: removed from active code after hardware sample
|
||||
|
||||
Date: 2026-05-11
|
||||
|
||||
Change:
|
||||
|
||||
- DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`.
|
||||
- Output frame buffers are owned by `PinnedMemoryAllocator`.
|
||||
- `VideoIOOutputFrame` carries a texture-transfer callback.
|
||||
- The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output.
|
||||
- If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.
|
||||
|
||||
Question:
|
||||
|
||||
Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?
|
||||
|
||||
Result:
|
||||
|
||||
- The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`.
|
||||
- The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`.
|
||||
- `GL_AMD_pinned_memory` was also unavailable.
|
||||
- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.
|
||||
|
||||
## Baseline: async_pbo
|
||||
|
||||
Date: 2026-05-11
|
||||
|
||||
Observed while the app was running after adding the async queue split counters.
|
||||
|
||||
Summary:
|
||||
|
||||
- ready queue was pinned at 0 or briefly 1
|
||||
- underrun, zero-depth, late, and dropped counts increased continuously
|
||||
- `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes
|
||||
- `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs`
|
||||
- PBO allocation/orphaning was effectively 0 ms
|
||||
|
||||
Representative samples:
|
||||
|
||||
| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
|
||||
| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
|
||||
| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
|
||||
| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |
|
||||
|
||||
Initial read:
|
||||
|
||||
The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.
|
||||
|
||||
## Experiment 1: sync
|
||||
|
||||
Status: sampled
|
||||
|
||||
Question:
|
||||
|
||||
Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?
|
||||
|
||||
Expected interpretation:
|
||||
|
||||
- If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
|
||||
- If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.
|
||||
|
||||
Results:
|
||||
|
||||
Date: 2026-05-11
|
||||
|
||||
Summary:
|
||||
|
||||
- ready queue remained pinned at 0
|
||||
- underrun, zero-depth, late, and dropped counts continued increasing
|
||||
- `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active
|
||||
- direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs`
|
||||
|
||||
Representative samples:
|
||||
|
||||
| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
|
||||
| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
|
||||
| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
|
||||
| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
|
||||
| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |
|
||||
|
||||
Read:
|
||||
|
||||
Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.
|
||||
|
||||
## Experiment 2: cached_only
|
||||
|
||||
Status: sampled
|
||||
|
||||
Question:
|
||||
|
||||
If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?
|
||||
|
||||
Expected interpretation:
|
||||
|
||||
- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
|
||||
- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.
|
||||
|
||||
Results:
|
||||
|
||||
Date: 2026-05-11
|
||||
|
||||
User-visible result:
|
||||
|
||||
- DeckLink reported a healthy 5-frame buffer.
|
||||
|
||||
Telemetry summary:
|
||||
|
||||
- `renderRequestMs` dropped to roughly 1-3 ms.
|
||||
- `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
|
||||
- `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap.
|
||||
- `syncFallbackCount` stayed at 1, confirming one bootstrap readback.
|
||||
- `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory.
|
||||
- late and dropped counts were 0 during the sampled run.
|
||||
- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.
|
||||
|
||||
Representative samples:
|
||||
|
||||
| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
|
||||
| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
|
||||
| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
|
||||
| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
|
||||
| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |
|
||||
|
||||
Read:
|
||||
|
||||
Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.
|
||||
|
||||
The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.
|
||||
@@ -363,6 +363,10 @@ components:
|
||||
$ref: "#/components/schemas/VideoIOStatus"
|
||||
performance:
|
||||
$ref: "#/components/schemas/PerformanceStatus"
|
||||
backendPlayout:
|
||||
$ref: "#/components/schemas/BackendPlayoutStatus"
|
||||
runtimeEvents:
|
||||
$ref: "#/components/schemas/RuntimeEventStatus"
|
||||
shaders:
|
||||
type: array
|
||||
items:
|
||||
@@ -382,10 +386,16 @@ components:
|
||||
type: number
|
||||
oscPort:
|
||||
type: number
|
||||
oscBindAddress:
|
||||
type: string
|
||||
oscSmoothing:
|
||||
type: number
|
||||
autoReload:
|
||||
type: boolean
|
||||
maxTemporalHistoryFrames:
|
||||
type: number
|
||||
previewFps:
|
||||
type: number
|
||||
enableExternalKeying:
|
||||
type: boolean
|
||||
inputVideoFormat:
|
||||
@@ -478,6 +488,175 @@ components:
|
||||
type: number
|
||||
flushedFrameCount:
|
||||
type: number
|
||||
BackendPlayoutStatus:
|
||||
type: object
|
||||
properties:
|
||||
lifecycleState:
|
||||
type: string
|
||||
example: running
|
||||
degraded:
|
||||
type: boolean
|
||||
statusMessage:
|
||||
type: string
|
||||
lateFrameCount:
|
||||
type: number
|
||||
droppedFrameCount:
|
||||
type: number
|
||||
flushedFrameCount:
|
||||
type: number
|
||||
readyQueue:
|
||||
$ref: "#/components/schemas/BackendReadyQueueStatus"
|
||||
outputRender:
|
||||
$ref: "#/components/schemas/BackendOutputRenderStatus"
|
||||
recovery:
|
||||
$ref: "#/components/schemas/BackendPlayoutRecoveryStatus"
|
||||
BackendReadyQueueStatus:
|
||||
type: object
|
||||
properties:
|
||||
depth:
|
||||
type: number
|
||||
description: Current number of ready output frames.
|
||||
capacity:
|
||||
type: number
|
||||
description: Maximum ready output frames currently allowed.
|
||||
minDepth:
|
||||
type: number
|
||||
description: Minimum observed ready queue depth since backend worker start.
|
||||
maxDepth:
|
||||
type: number
|
||||
description: Maximum observed ready queue depth since backend worker start.
|
||||
zeroDepthCount:
|
||||
type: number
|
||||
description: Number of observed samples where the ready queue was empty.
|
||||
pushedCount:
|
||||
type: number
|
||||
poppedCount:
|
||||
type: number
|
||||
droppedCount:
|
||||
type: number
|
||||
underrunCount:
|
||||
type: number
|
||||
BackendOutputRenderStatus:
|
||||
type: object
|
||||
properties:
|
||||
renderMs:
|
||||
type: number
|
||||
description: Most recent output render duration in milliseconds.
|
||||
smoothedRenderMs:
|
||||
type: number
|
||||
description: Smoothed output render duration in milliseconds.
|
||||
maxRenderMs:
|
||||
type: number
|
||||
description: Maximum observed output render duration in milliseconds.
|
||||
acquireFrameMs:
|
||||
type: number
|
||||
description: Time spent acquiring a writable backend output frame in milliseconds.
|
||||
renderRequestMs:
|
||||
type: number
|
||||
description: Time spent executing the render-thread output frame request in milliseconds.
|
||||
endAccessMs:
|
||||
type: number
|
||||
description: Time spent ending write access to the backend output frame in milliseconds.
|
||||
queueWaitMs:
|
||||
type: number
|
||||
description: Time the output render request spent waiting for the render thread in milliseconds.
|
||||
drawMs:
|
||||
type: number
|
||||
description: Time spent drawing, blitting, packing, and flushing the output frame in milliseconds.
|
||||
fenceWaitMs:
|
||||
type: number
|
||||
description: Time spent waiting for the async readback fence in milliseconds.
|
||||
mapMs:
|
||||
type: number
|
||||
description: Time spent mapping the async readback pixel buffer in milliseconds.
|
||||
readbackCopyMs:
|
||||
type: number
|
||||
description: Time spent copying async readback bytes into the backend output frame in milliseconds.
|
||||
cachedCopyMs:
|
||||
type: number
|
||||
description: Time spent copying the cached output frame when async readback is not ready in milliseconds.
|
||||
asyncQueueMs:
|
||||
type: number
|
||||
description: Time spent queueing the next async readback in milliseconds.
|
||||
asyncQueueBufferMs:
|
||||
type: number
|
||||
description: Time spent orphaning or allocating the async readback pixel buffer in milliseconds.
|
||||
asyncQueueSetupMs:
|
||||
type: number
|
||||
description: Time spent applying readback pixel-store, framebuffer, and pixel-pack-buffer state in milliseconds.
|
||||
asyncQueueReadPixelsMs:
|
||||
type: number
|
||||
description: Time spent issuing glReadPixels for the async readback in milliseconds.
|
||||
asyncQueueFenceMs:
|
||||
type: number
|
||||
description: Time spent creating the async readback fence in milliseconds.
|
||||
syncReadMs:
|
||||
type: number
|
||||
description: Time spent in bootstrap synchronous readback in milliseconds.
|
||||
asyncReadbackMissCount:
|
||||
type: number
|
||||
description: Count of output render requests where async readback was not ready.
|
||||
cachedFallbackCount:
|
||||
type: number
|
||||
description: Count of output render requests served from the cached output frame.
|
||||
syncFallbackCount:
|
||||
type: number
|
||||
description: Count of output render requests that used bootstrap synchronous readback.
|
||||
BackendPlayoutRecoveryStatus:
|
||||
type: object
|
||||
properties:
|
||||
completionResult:
|
||||
type: string
|
||||
enum: [Completed, DisplayedLate, Dropped, Flushed, Unknown]
|
||||
completedFrameIndex:
|
||||
type: number
|
||||
scheduledFrameIndex:
|
||||
type: number
|
||||
scheduledLeadFrames:
|
||||
type: number
|
||||
measuredLagFrames:
|
||||
type: number
|
||||
catchUpFrames:
|
||||
type: number
|
||||
lateStreak:
|
||||
type: number
|
||||
dropStreak:
|
||||
type: number
|
||||
RuntimeEventStatus:
|
||||
type: object
|
||||
properties:
|
||||
queue:
|
||||
$ref: "#/components/schemas/RuntimeEventQueueStatus"
|
||||
dispatch:
|
||||
$ref: "#/components/schemas/RuntimeEventDispatchStatus"
|
||||
RuntimeEventQueueStatus:
|
||||
type: object
|
||||
properties:
|
||||
name:
|
||||
type: string
|
||||
depth:
|
||||
type: number
|
||||
capacity:
|
||||
type: number
|
||||
droppedCount:
|
||||
type: number
|
||||
oldestEventAgeMs:
|
||||
type: number
|
||||
RuntimeEventDispatchStatus:
|
||||
type: object
|
||||
properties:
|
||||
dispatchCallCount:
|
||||
type: number
|
||||
dispatchedEventCount:
|
||||
type: number
|
||||
handlerInvocationCount:
|
||||
type: number
|
||||
handlerFailureCount:
|
||||
type: number
|
||||
lastDispatchDurationMs:
|
||||
type: number
|
||||
maxDispatchDurationMs:
|
||||
type: number
|
||||
ShaderSummary:
|
||||
type: object
|
||||
properties:
|
||||
@@ -497,6 +676,8 @@ components:
|
||||
description: Error text for unavailable shader packages.
|
||||
temporal:
|
||||
$ref: "#/components/schemas/TemporalState"
|
||||
feedback:
|
||||
$ref: "#/components/schemas/FeedbackState"
|
||||
TemporalState:
|
||||
type: object
|
||||
properties:
|
||||
@@ -509,6 +690,13 @@ components:
|
||||
type: number
|
||||
effectiveHistoryLength:
|
||||
type: number
|
||||
FeedbackState:
|
||||
type: object
|
||||
properties:
|
||||
enabled:
|
||||
type: boolean
|
||||
writePass:
|
||||
type: string
|
||||
LayerState:
|
||||
type: object
|
||||
properties:
|
||||
|
||||
Reference in New Issue
Block a user