Performance chasing

2026-05-11 23:10:45 +10:00
parent c5cead6003
commit a434a88108
18 changed files with 1115 additions and 82 deletions
--- a/docs/PHASE_7_5_PROACTIVE_PLAYOUT_TIMING_DESIGN.md
+++ b/docs/PHASE_7_5_PROACTIVE_PLAYOUT_TIMING_DESIGN.md
@@ -7,7 +7,7 @@ Phase 7 made backend lifecycle, playout policy, ready-frame queueing, late/drop
 ## Status

 - Phase 7.5 design package: proposed.
- Phase 7.5 implementation: Step 2 in progress.
+- Phase 7.5 implementation: Step 5 in progress.
 - Current alignment: Phase 7 is complete. `RenderOutputQueue`, `VideoPlayoutPolicy`, `VideoPlayoutScheduler`, `VideoBackendLifecycle`, and backend playout telemetry exist. The backend worker fills the ready queue on completion demand, but render production is not yet proactively driven by queue pressure or video cadence.

 Current footholds:
@@ -19,6 +19,9 @@ Current footholds:
 - `HealthTelemetry::BackendPlayoutSnapshot` exposes queue depth, underruns, late/drop streaks, and recovery decisions.
 - Step 1 adds baseline timing fields for ready-queue min/max/zero-depth samples and output render duration.
 - Step 2 adds a pure `OutputProductionController` for queue-pressure production decisions.
+- Step 3 adds a proactive output producer worker that keeps `RenderOutputQueue` warm after playback starts.
+- Step 4 skips non-forced preview presentation while output ready-queue depth is below target.
+- Step 5 makes async readback misses prefer cached output over synchronous readback after bootstrap.

 ## Timing Review Findings

@@ -199,15 +202,23 @@ Move from demand-filled output production to queue-pressure production.

 Initial target:

- producer wakes when queue depth is below target
- producer requests render-thread output production until target depth is reached
- producer stops when backend stops or render thread shuts down
- completion worker mostly schedules from already-ready frames
+- [x] producer wakes when queue depth is below target
+- [x] producer requests render-thread output production until target depth is reached
+- [x] producer stops when backend stops or render thread shuts down
+- [x] completion worker mostly schedules from already-ready frames

 Exit criteria:

- normal playback does not depend on completion processing to fill the queue from empty
- callback/completion pressure and render production pressure are separate
+- [x] normal playback does not depend on completion processing to fill the queue from empty
+- [x] callback/completion pressure and render production pressure are separate
+
+Implementation notes:
+
+- `VideoBackend` starts the completion worker before device start, then starts the output producer only after DeckLink start succeeds. This avoids fighting DeckLink preroll for the same output frame pool.
+- `OutputProducerWorkerMain()` periodically wakes and uses `OutputProductionController` to decide whether to produce, wait, or throttle.
+- Completion handling records pacing/recovery, updates producer pressure, schedules a ready frame, and wakes the producer to refill headroom.
+- Completion handling keeps a one-frame synchronous fallback when the ready queue is unexpectedly empty, then falls back to black underrun behavior if that also fails.
+- Producer shutdown is explicit and joined before video output teardown.

 ### Step 4. Prioritize Playout Over Preview

@@ -215,15 +226,21 @@ Make preview explicitly subordinate to output playout deadlines.

 Initial target:

- skip or delay preview when ready queue depth is below target
+- [x] skip or delay preview when ready queue depth is below target
 - count skipped previews
 - record preview present cost separately from output render cost

 Exit criteria:

- preview cannot drain output headroom invisibly
+- [x] preview cannot drain output headroom invisibly
 - runtime telemetry shows preview skips and preview present cost

+Implementation notes:
+
+- `OpenGLComposite::paintGL(false)` now skips preview presentation when `VideoBackend` reports that the ready queue is below the target depth.
+- Forced preview paints are still allowed so resize/manual paint behavior remains intact.
+- Preview skip counters and present-cost telemetry remain follow-up work for this step.
+
 ### Step 5. Make Readback Miss Policy Deadline-Aware

 Avoid turning a late async readback fence into synchronous deadline pressure by default.
@@ -232,13 +249,20 @@ Initial target:

 - count async readback misses
 - count synchronous fallback uses
- allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
- keep current fallback available while behavior is measured
+- [x] allow policy to prefer stale/black output over synchronous fallback when queue pressure is high
+- [x] keep current fallback available while behavior is measured

 Exit criteria:

- readback fallback is an explicit policy decision
- late GPU fences do not automatically block the most timing-sensitive path
+- [x] readback fallback is an explicit policy decision
+- [x] late GPU fences do not automatically block the most timing-sensitive path
+
+Implementation notes:
+
+- `OpenGLRenderPipeline::ReadOutputFrame()` now uses synchronous readback only to bootstrap the first cached output frame.
+- After cached output exists, an async readback miss copies the cached output frame into the DeckLink output frame instead of blocking on synchronous `glReadPixels`.
+- Async readback queueing now skips when the next PBO slot is still in flight rather than deleting an in-flight fence and overwriting it.
+- Miss/fallback counters remain follow-up telemetry work for this step.

 ### Step 6. Tune Headroom Policy

--- a/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
+++ b/docs/PHASE_7_5_READBACK_EXPERIMENT_LOG.md
@@ -0,0 +1,165 @@
+# Phase 7.5 Readback Experiment Log
+
+This log tracks short readback experiments during the proactive playout timing work.
+
+## How To Run
+
+The default debugger launch keeps the current production path:
+
+- `Debug LoopThroughWithOpenGLCompositing`
+- `VST_OUTPUT_READBACK_MODE` unset
+- mode: `async_pbo`
+
+Comparison modes are still available:
+
+- `VST_OUTPUT_READBACK_MODE=async_pbo`
+- uses the older PBO/fence readback path
+
+The experiment launches are:
+
+- `Debug LoopThroughWithOpenGLCompositing - sync readback experiment`
+- `VST_OUTPUT_READBACK_MODE=sync`
+- uses direct synchronous `glReadPixels()` every output frame
+
+- `Debug LoopThroughWithOpenGLCompositing - cached output experiment`
+- `VST_OUTPUT_READBACK_MODE=cached_only`
+- uses one bootstrap synchronous readback, then copies the cached output frame without ongoing GPU readback
+
+The cached-output experiment is not visually correct for live motion. It exists to test whether removing ongoing GPU readback lets the producer fill the ready queue again.
+
+## Experiment 3: fast_transfer
+
+Status: removed from active code after hardware sample
+
+Date: 2026-05-11
+
+Change:
+
+- DeckLink output frames are now created with `CreateVideoFrameWithBuffer()`.
+- Output frame buffers are owned by `PinnedMemoryAllocator`.
+- `VideoIOOutputFrame` carries a texture-transfer callback.
+- The test branch changed the default render readback path to try `VideoFrameTransfer::GPUtoCPU` against the output texture for BGRA output.
+- If fast transfer is unavailable or fails, the code falls back to cached output if present, then synchronous readback as a safety fallback.
+
+Question:
+
+Can SDK-style pinned/DVP transfer recover real rendered output timing without the visually-invalid cached-only shortcut?
+
+Result:
+
+- The test machine reported `GL_VENDOR=NVIDIA Corporation` and `GL_RENDERER=NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2`.
+- The DeckLink SDK OpenGL fast-transfer sample gates NVIDIA DVP on `GL_RENDERER` containing `Quadro`.
+- `GL_AMD_pinned_memory` was also unavailable.
+- The fast-transfer path was removed from active code to avoid carrying unsupported DVP dependencies while we investigate CPU-frame buffering and render-ahead.
+
+## Baseline: async_pbo
+
+Date: 2026-05-11
+
+Observed while the app was running after adding the async queue split counters.
+
+Summary:
+
+- ready queue was pinned at 0 or briefly 1
+- underrun, zero-depth, late, and dropped counts increased continuously
+- `renderRequestMs` usually sat around 16-25 ms, with occasional larger spikes
+- `asyncQueueMs` was mostly explained by `asyncQueueReadPixelsMs`
+- PBO allocation/orphaning was effectively 0 ms
+
+Representative samples:
+
+| readyDepth | renderRequestMs | queueWaitMs | drawMs | mapMs | copyMs | asyncQueueMs | asyncQueueBufferMs | asyncQueueReadPixelsMs |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| 0 | 24.915 | 3.018 | 0.510 | 0.923 | 0.768 | 9.018 | 0.000 | 9.001 |
+| 0 | 16.226 | 3.066 | 0.518 | 1.202 | 0.812 | 8.611 | 0.000 | 8.598 |
+| 0 | 12.134 | 3.796 | 3.579 | 1.378 | 0.690 | 10.323 | 0.000 | 10.311 |
+| 0 | 17.496 | 2.817 | 0.523 | 1.267 | 1.160 | 9.416 | 0.000 | 9.403 |
+
+Initial read:
+
+The main repeated cost is issuing `glReadPixels(..., nullptr)` into the PBO. `glBufferData`, setup, fence creation, fence wait, map, and CPU copy are not large enough to explain the underruns.
+
+## Experiment 1: sync
+
+Status: sampled
+
+Question:
+
+Does the direct synchronous readback path perform better or worse than the current PBO path on this machine and DeckLink format?
+
+Expected interpretation:
+
+- If `syncReadMs` is lower than `asyncQueueReadPixelsMs` and the ready queue improves, the current PBO path is the wrong strategy for this driver/format.
+- If `syncReadMs` is also high and the ready queue remains empty, any GPU-to-CPU readback in this path is too expensive for the current producer cadence.
+
+Results:
+
+Date: 2026-05-11
+
+Summary:
+
+- ready queue remained pinned at 0
+- underrun, zero-depth, late, and dropped counts continued increasing
+- `asyncQueueMs` and async readback counters were 0, confirming the experiment mode was active
+- direct `syncReadMs` was generally worse than the baseline PBO `asyncQueueReadPixelsMs`
+
+Representative samples:
+
+| readyDepth | renderRequestMs | queueWaitMs | drawMs | syncReadMs | asyncQueueMs | syncFallbackCount |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| 0 | 32.467 | 5.764 | 1.389 | 23.122 | 0.000 | 680 |
+| 0 | 29.722 | 2.603 | 0.512 | 25.538 | 0.000 | 697 |
+| 0 | 37.844 | 7.716 | 0.518 | 23.608 | 0.000 | 706 |
+| 0 | 22.304 | 3.089 | 1.843 | 15.278 | 0.000 | 723 |
+| 0 | 27.196 | 4.015 | 0.500 | 21.933 | 0.000 | 736 |
+
+Read:
+
+Direct synchronous readback does not recover the queue and is slower than the async PBO path on the sampled run. The bottleneck appears to be GPU-to-CPU readback itself, not PBO orphaning or fence handling.
+
+## Experiment 2: cached_only
+
+Status: sampled
+
+Question:
+
+If ongoing GPU readback is removed after bootstrap, can the producer keep the ready queue above 0?
+
+Expected interpretation:
+
+- If ready depth rises and underruns slow or stop, readback is the primary bottleneck.
+- If ready depth still stays near 0, the bottleneck is elsewhere in scheduling, frame acquisition, queueing, or DeckLink handoff.
+
+Results:
+
+Date: 2026-05-11
+
+User-visible result:
+
+- DeckLink reported a healthy 5-frame buffer.
+
+Telemetry summary:
+
+- `renderRequestMs` dropped to roughly 1-3 ms.
+- `cachedCopyMs` was usually around 0.8-1.0 ms, with one sampled low value around 0.37 ms.
+- `asyncQueueMs`, `asyncQueueReadPixelsMs`, `syncReadMs`, fence wait, map, and async copy were 0 after bootstrap.
+- `syncFallbackCount` stayed at 1, confirming one bootstrap readback.
+- `cachedFallbackCount` increased continuously, confirming ongoing frames were served from cached CPU memory.
+- late and dropped counts were 0 during the sampled run.
+- internal ready queue depth still reported mostly 0-1 even while DeckLink showed a healthy hardware/device buffer.
+
+Representative samples:
+
+| readyDepth | renderRequestMs | queueWaitMs | drawMs | cachedCopyMs | asyncQueueMs | syncReadMs | late | dropped |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| 0 | 1.446 | 0.018 | 0.518 | 0.864 | 0.000 | 0.000 | 0 | 0 |
+| 0 | 2.586 | 1.089 | 0.514 | 0.829 | 0.000 | 0.000 | 0 | 0 |
+| 0 | 1.481 | 2.378 | 0.502 | 0.911 | 0.000 | 0.000 | 0 | 0 |
+| 0 | 0.892 | 0.013 | 0.468 | 0.371 | 0.000 | 0.000 | 0 | 0 |
+| 1 | 1.398 | 0.019 | 0.483 | 0.819 | 0.000 | 0.000 | 0 | 0 |
+
+Read:
+
+Removing ongoing GPU readback recovers output timing immediately. The direct cause of the Phase 7.5 playback collapse is the per-frame GPU-to-CPU readback cost, not DeckLink frame acquisition, output frame end-access, PBO allocation, fence waiting, or CPU copy.
+
+The internal ready queue depth still being low while DeckLink reports a healthy device buffer suggests the ready queue is acting as a short staging queue rather than the full device playout buffer. For the next fix, prioritize avoiding a blocking readback on every output frame instead of only increasing internal ready queue depth.
--- a/docs/openapi.yaml
+++ b/docs/openapi.yaml
@@ -363,6 +363,10 @@ components:
          $ref: "#/components/schemas/VideoIOStatus"
        performance:
          $ref: "#/components/schemas/PerformanceStatus"
+        backendPlayout:
+          $ref: "#/components/schemas/BackendPlayoutStatus"
+        runtimeEvents:
+          $ref: "#/components/schemas/RuntimeEventStatus"
        shaders:
          type: array
          items:
@@ -382,10 +386,16 @@ components:
          type: number
        oscPort:
          type: number
+        oscBindAddress:
+          type: string
+        oscSmoothing:
+          type: number
        autoReload:
          type: boolean
        maxTemporalHistoryFrames:
          type: number
+        previewFps:
+          type: number
        enableExternalKeying:
          type: boolean
        inputVideoFormat:
@@ -478,6 +488,175 @@ components:
          type: number
        flushedFrameCount:
          type: number
+    BackendPlayoutStatus:
+      type: object
+      properties:
+        lifecycleState:
+          type: string
+          example: running
+        degraded:
+          type: boolean
+        statusMessage:
+          type: string
+        lateFrameCount:
+          type: number
+        droppedFrameCount:
+          type: number
+        flushedFrameCount:
+          type: number
+        readyQueue:
+          $ref: "#/components/schemas/BackendReadyQueueStatus"
+        outputRender:
+          $ref: "#/components/schemas/BackendOutputRenderStatus"
+        recovery:
+          $ref: "#/components/schemas/BackendPlayoutRecoveryStatus"
+    BackendReadyQueueStatus:
+      type: object
+      properties:
+        depth:
+          type: number
+          description: Current number of ready output frames.
+        capacity:
+          type: number
+          description: Maximum ready output frames currently allowed.
+        minDepth:
+          type: number
+          description: Minimum observed ready queue depth since backend worker start.
+        maxDepth:
+          type: number
+          description: Maximum observed ready queue depth since backend worker start.
+        zeroDepthCount:
+          type: number
+          description: Number of observed samples where the ready queue was empty.
+        pushedCount:
+          type: number
+        poppedCount:
+          type: number
+        droppedCount:
+          type: number
+        underrunCount:
+          type: number
+    BackendOutputRenderStatus:
+      type: object
+      properties:
+        renderMs:
+          type: number
+          description: Most recent output render duration in milliseconds.
+        smoothedRenderMs:
+          type: number
+          description: Smoothed output render duration in milliseconds.
+        maxRenderMs:
+          type: number
+          description: Maximum observed output render duration in milliseconds.
+        acquireFrameMs:
+          type: number
+          description: Time spent acquiring a writable backend output frame in milliseconds.
+        renderRequestMs:
+          type: number
+          description: Time spent executing the render-thread output frame request in milliseconds.
+        endAccessMs:
+          type: number
+          description: Time spent ending write access to the backend output frame in milliseconds.
+        queueWaitMs:
+          type: number
+          description: Time the output render request spent waiting for the render thread in milliseconds.
+        drawMs:
+          type: number
+          description: Time spent drawing, blitting, packing, and flushing the output frame in milliseconds.
+        fenceWaitMs:
+          type: number
+          description: Time spent waiting for the async readback fence in milliseconds.
+        mapMs:
+          type: number
+          description: Time spent mapping the async readback pixel buffer in milliseconds.
+        readbackCopyMs:
+          type: number
+          description: Time spent copying async readback bytes into the backend output frame in milliseconds.
+        cachedCopyMs:
+          type: number
+          description: Time spent copying the cached output frame when async readback is not ready in milliseconds.
+        asyncQueueMs:
+          type: number
+          description: Time spent queueing the next async readback in milliseconds.
+        asyncQueueBufferMs:
+          type: number
+          description: Time spent orphaning or allocating the async readback pixel buffer in milliseconds.
+        asyncQueueSetupMs:
+          type: number
+          description: Time spent applying readback pixel-store, framebuffer, and pixel-pack-buffer state in milliseconds.
+        asyncQueueReadPixelsMs:
+          type: number
+          description: Time spent issuing glReadPixels for the async readback in milliseconds.
+        asyncQueueFenceMs:
+          type: number
+          description: Time spent creating the async readback fence in milliseconds.
+        syncReadMs:
+          type: number
+          description: Time spent in bootstrap synchronous readback in milliseconds.
+        asyncReadbackMissCount:
+          type: number
+          description: Count of output render requests where async readback was not ready.
+        cachedFallbackCount:
+          type: number
+          description: Count of output render requests served from the cached output frame.
+        syncFallbackCount:
+          type: number
+          description: Count of output render requests that used bootstrap synchronous readback.
+    BackendPlayoutRecoveryStatus:
+      type: object
+      properties:
+        completionResult:
+          type: string
+          enum: [Completed, DisplayedLate, Dropped, Flushed, Unknown]
+        completedFrameIndex:
+          type: number
+        scheduledFrameIndex:
+          type: number
+        scheduledLeadFrames:
+          type: number
+        measuredLagFrames:
+          type: number
+        catchUpFrames:
+          type: number
+        lateStreak:
+          type: number
+        dropStreak:
+          type: number
+    RuntimeEventStatus:
+      type: object
+      properties:
+        queue:
+          $ref: "#/components/schemas/RuntimeEventQueueStatus"
+        dispatch:
+          $ref: "#/components/schemas/RuntimeEventDispatchStatus"
+    RuntimeEventQueueStatus:
+      type: object
+      properties:
+        name:
+          type: string
+        depth:
+          type: number
+        capacity:
+          type: number
+        droppedCount:
+          type: number
+        oldestEventAgeMs:
+          type: number
+    RuntimeEventDispatchStatus:
+      type: object
+      properties:
+        dispatchCallCount:
+          type: number
+        dispatchedEventCount:
+          type: number
+        handlerInvocationCount:
+          type: number
+        handlerFailureCount:
+          type: number
+        lastDispatchDurationMs:
+          type: number
+        maxDispatchDurationMs:
+          type: number
    ShaderSummary:
      type: object
      properties:
@@ -497,6 +676,8 @@ components:
          description: Error text for unavailable shader packages.
        temporal:
          $ref: "#/components/schemas/TemporalState"
+        feedback:
+          $ref: "#/components/schemas/FeedbackState"
    TemporalState:
      type: object
      properties:
@@ -509,6 +690,13 @@ components:
          type: number
        effectiveHistoryLength:
          type: number
+    FeedbackState:
+      type: object
+      properties:
+        enabled:
+          type: boolean
+        writePass:
+          type: string
    LayerState:
      type: object
      properties: