# Phase 6 Design: Background Persistence This document expands Phase 6 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target. Phases 1-5 separate durable state, coordination policy, render-facing snapshots, render-thread ownership, and live-state layering. Phase 6 should make disk persistence a background snapshot-writing concern instead of a synchronous side effect of mutations. ## Status - Phase 6 design package: complete. - Phase 6 implementation: Step 6 complete. - Current alignment: `RuntimeStore` owns durable serialization, config, package metadata, preset IO, and persistence request execution; `CommittedLiveState` owns the current committed/session layer state; and `RuntimeCoordinator` publishes typed persistence requests for persisted mutations. Runtime-state persistence is now requested through the coordinator/event path and executed by the background writer. Current persistence footholds: - `RuntimeStore` owns persistent runtime-state serialization, stack preset serialization, and durable file IO. - `CommittedLiveState` owns current committed/session layer and parameter state. - `RuntimeCoordinatorResult::persistenceRequested` exists as an explicit mutation outcome. - `RuntimeEventType::RuntimePersistenceRequested` now carries a `PersistenceRequest`. - `PersistenceRequest` and `PersistenceSnapshot` name the request/snapshot contract that later steps will hand to the writer. - Phase 5 clarified which live-state mutations are durable, committed-live, transient automation, or render-local. Settled OSC commits are session-only by default and do not request persistence. ## Why Phase 6 Exists Synchronous persistence is a poor fit for live software. A mutation that changes state should not also have to block on filesystem timing, antivirus scans, slow disks, or transient IO failures. The app needs persistence to be reliable and observable, but not timing-sensitive. The resilience review calls this out because synchronous save-after-mutate behavior can create unnecessary stalls and makes recovery harder to reason about. Phase 6 should turn persistence into: - request - snapshot - background write - completion/failure observation ## Goals Phase 6 should establish: - a queued persistence request path - debounced/coalesced durable-state snapshot writes - atomic file replacement for runtime-state saves where practical - structured completion/failure reporting - clear separation between state mutation and disk flush - deterministic shutdown flushing policy - tests for coalescing, snapshot selection, write failure, and shutdown behavior without rendering or DeckLink ## Non-Goals Phase 6 should not require: - changing live-state layering rules - changing DeckLink/backend lifecycle - replacing stack preset semantics wholesale - adding cloud sync or external storage - building an unlimited historical state archive - making every write async immediately if a narrow compatibility path still needs a synchronous result ## Target Model Phase 6 should make persistence a small pipeline: ```text RuntimeCoordinator accepts mutation -> publishes/returns persistence request -> PersistenceWriter captures a durable snapshot from RuntimeStore serialization -> background worker debounces/coalesces writes -> atomic write commits file -> HealthTelemetry/runtime event records success or failure ``` The key rule is: - `RuntimeStore` owns durable state and serialization - `CommittedLiveState` owns current session state; only coordinator-approved durable snapshots should be persisted - `PersistenceWriter` owns when and how snapshots are written - `RuntimeCoordinator` owns whether a mutation requests persistence ## Proposed Collaborators ### `PersistenceWriter` Owns the worker thread, queue, debounce timer, and write execution. Responsibilities: - accept persistence requests - coalesce repeated runtime-state writes - request/build a durable snapshot from `RuntimeStore` - write to a temporary file and atomically replace the target - report success/failure observations - flush on shutdown according to policy Non-responsibilities: - deciding mutation validity - owning durable in-memory state - composing render snapshots - blocking render/backend timing paths ### `PersistenceSnapshot` Immutable write input captured from durable state. Responsibilities: - contain serialized runtime-state text or structured data ready to serialize - identify target path and snapshot generation - preserve enough metadata for completion/failure diagnostics Non-responsibilities: - mutation policy - file IO ### `PersistenceRequest` Small request object or event payload. Expected fields: - reason/action name - target kind: runtime state, preset, config if later needed - optional debounce key - force/flush flag for explicit save operations - generation or sequence ## Write Policy ### Runtime State Default policy: - coalesce repeated requests - debounce short bursts - write newest snapshot - report failures without blocking render/control paths ### Stack Presets Preset save is more operator-explicit than routine runtime-state persistence. Initial policy options: - keep preset save synchronous while runtime-state persistence becomes async - or route preset writes through the same worker with a completion result for the caller Conservative Phase 6 default: - background runtime-state persistence first - leave preset save/load synchronous unless the implementation has a clean completion path ### Shutdown Shutdown should explicitly decide: - flush latest pending snapshot before exit - skip flush if no pending durable change exists - report/write failure if flush fails - avoid indefinite hang on shutdown ## Atomicity And Failure Handling Runtime-state writes should prefer: 1. serialize snapshot content in memory 2. write to `target.tmp` 3. flush/close file 4. replace target atomically where platform support allows 5. retain or report backup/failure context if replacement fails Failures should not silently disappear. They should publish: - persistence target - reason/action - error message - whether a newer request is pending - whether the app is still running with unsaved changes ## Migration Plan ### Step 1. Name Persistence Requests Make request types and event payloads explicit enough that callers stop thinking in terms of direct disk writes. Initial target: - [x] keep existing coordinator persistence decisions - [x] introduce a `PersistenceRequest`/`PersistenceSnapshot` shape - [x] document which requests are debounceable Current implementation: - `runtime/persistence/PersistenceRequest.h` defines `PersistenceTargetKind`, `PersistenceRequest`, and `PersistenceSnapshot`. - `RuntimePersistenceRequestedEvent` carries a typed `PersistenceRequest`. - `RuntimeCoordinator` emits runtime-state persistence requests with reason, debounce key, and debounce policy. - Existing synchronous save behavior is intentionally unchanged until Step 2/3. ### Step 2. Extract Snapshot Writing From `RuntimeStore` Move file-write mechanics behind a helper while keeping serialization ownership in `RuntimeStore`. Initial target: - [x] `RuntimeStore` can build serialized runtime-state snapshots - [x] `PersistenceWriter` writes the snapshot - [x] existing synchronous save path can call through the writer/helper during transition Current implementation: - `RuntimeStore::BuildRuntimeStatePersistenceSnapshot(...)` captures serialized runtime-state content and target path. - `PersistenceWriter::WriteSnapshot(...)` owns the temp-file and replace write mechanics. - Runtime-state persistence now flows through `RuntimeStore::RequestPersistence(...)` and the background writer. - Stack preset saves still use `PersistenceWriter` synchronously; preset async policy remains a later decision. ### Step 3. Add Debounced Background Worker Introduce a worker thread or queued task owner. Initial target: - [x] repeated runtime-state requests coalesce - [x] worker writes only latest pending snapshot - [x] tests cover coalescing without filesystem where possible Current implementation: - `PersistenceWriter::EnqueueSnapshot(...)` starts a worker lazily and debounces snapshots by `debounceKey`. - Runtime-state saves enqueue debounced snapshots, so routine mutation paths no longer write the runtime-state file directly. - Synchronous `PersistenceWriter::WriteSnapshot(...)` remains for stack preset saves. - `PersistenceWriterTests` use an injected in-memory sink to verify coalescing and non-coalesced immediate requests without touching the filesystem. ### Step 4. Add Atomic Write And Failure Reporting Make disk writes safer and observable. Initial target: - [x] temp-file then replace - [x] failure returned/published with structured reason - [x] `HealthTelemetry` receives persistence warning state Current implementation: - `PersistenceWriter::WriteSnapshot(...)` and worker writes use temp-file then `MoveFileExA(..., MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)`. - `PersistenceWriteResult` reports target kind, target path, reason, success/failure, error message, and whether newer work was pending. - `RuntimeStore` wires persistence write results into `HealthTelemetry`. - `HealthTelemetry` records persistence success/failure counts, last target/reason/error, pending-newer-request state, and unsaved-change state. ### Step 5. Wire Coordinator/Event Requests To Writer Route `RuntimePersistenceRequested` or coordinator persistence outcomes into the writer. Initial target: - [x] accepted durable mutations request persistence - [x] transient-only mutations do not - [x] runtime reload/preset policies remain explicit Current implementation: - Store mutation methods update committed durable/session state and mark render state dirty, but no longer enqueue runtime-state writes directly. - `RuntimeCoordinator` remains the owner of the persistence decision and publishes `RuntimePersistenceRequested` only for accepted durable mutations. - `RuntimeUpdateController` handles `RuntimePersistenceRequested` and calls `RuntimeStore::RequestPersistence(...)`. - `RuntimeStore::RequestPersistence(...)` validates the request target, builds the runtime-state snapshot, enqueues it on `PersistenceWriter`, and records enqueue failures in `HealthTelemetry`. - Stack preset save remains a synchronous preset-file write; preset load updates state and relies on the coordinator persistence request for runtime-state persistence. ### Step 6. Define Shutdown Flush Make app shutdown persistence behavior deterministic. Initial target: - [x] stop accepting new requests - [x] flush latest pending snapshot with bounded wait - [x] report failure if flush fails Current implementation: - `PersistenceWriter::StopAndFlush(timeout, error)` stops accepting new snapshots, forces debounced snapshots ready, drains pending work, and reports timeout/failure to the caller. - `RuntimeStore::FlushPersistenceForShutdown(...)` provides the runtime-level shutdown API and records flush failures in `HealthTelemetry`. - `OpenGLComposite::Stop()` and the destructor explicitly flush persistence after control services/backend/render-thread shutdown. - `PersistenceWriterTests` cover shutdown draining, request rejection after shutdown, and timeout/retry behavior without rendering or DeckLink. ## Testing Strategy Recommended tests: - repeated persistence requests coalesce into one write - newest snapshot wins after multiple mutations - transient-only mutation does not request persistence - write failure records an error and keeps unsaved state visible - shutdown flush writes pending snapshot - shutdown with no pending request does not write - preset save path remains explicit - temp-file replacement success/failure is handled Useful homes: - `RuntimeSubsystemTests` for coordinator persistence outcomes - a new `PersistenceWriterTests` target for worker/coalescing/write policy - filesystem tests using a temporary directory for atomic write behavior ## Risks ### Data Loss Risk Debouncing introduces a window where in-memory state is newer than disk. Shutdown flush and unsaved-state telemetry are the guardrails. ### Complexity Risk A persistence worker can become a hidden second store if it owns mutable truth. It should own snapshots and write policy only. ### Blocking Shutdown Risk Flushing forever on shutdown is not acceptable. Use bounded waits and visible failure reporting. ### Preset Semantics Risk Operator-triggered preset save often feels like it should complete before reporting success. Keep preset behavior explicit rather than silently changing it. ## Phase 6 Exit Criteria Phase 6 can be considered complete once the project can say: - [x] durable mutations enqueue persistence instead of directly writing from mutation paths - [x] runtime-state writes are debounced/coalesced - [x] writes use temp-file/replace or equivalent atomic policy - [x] persistence failures are reported through structured health/events - [x] transient/live-only mutations do not request persistence - [x] shutdown flush behavior is explicit and tested - [x] `RuntimeStore` remains durable-state/serialization owner, not worker policy owner - [x] persistence behavior has focused non-render tests ## Open Questions - Should preset save remain synchronous, or move behind a completion-based async request? - What debounce interval is appropriate for routine runtime-state writes? - Should failed persistence retry automatically, or wait for the next mutation/request? - Should the app expose "unsaved changes" in the UI/health snapshot? - Should runtime config writes share this worker, or stay separate? ## Short Version Phase 6 should make persistence boring, safe, and off the hot path. Mutations update in-memory durable state. Persistence requests are queued and coalesced. A background writer saves atomic snapshots and reports failures. Render, backend callbacks, and control ingress should not pay filesystem costs.