video-shader-toys/docs/PHASE_6_BACKGROUND_PERSISTENCE_DESIGN.md

# Phase 6 Design: Background Persistence

This document expands Phase 6 of [ARCHITECTURE_RESILIENCE_REVIEW.md](/c:/Users/Aiden/Documents/GitHub/video-shader-toys/docs/ARCHITECTURE_RESILIENCE_REVIEW.md) into a concrete design target.

Phases 1-5 separate durable state, coordination policy, render-facing snapshots, render-thread ownership, and live-state layering. Phase 6 should make disk persistence a background snapshot-writing concern instead of a synchronous side effect of mutations.

## Status

- Phase 6 design package: proposed.
- Phase 6 implementation: Step 4 complete.
- Current alignment: `RuntimeStore` owns durable serialization, config, package metadata, preset IO, and persistence requests; `CommittedLiveState` owns the current committed/session layer state; and `RuntimeCoordinator` publishes typed persistence requests for persisted mutations. The remaining issue is that actual disk writes are still synchronous store work rather than queued, debounced, atomic background writes.

Current persistence footholds:

- `RuntimeStore` owns persistent runtime-state serialization, stack preset serialization, and durable file IO.
- `CommittedLiveState` owns current committed/session layer and parameter state.
- `RuntimeCoordinatorResult::persistenceRequested` exists as an explicit mutation outcome.
- `RuntimeEventType::RuntimePersistenceRequested` now carries a `PersistenceRequest`.
- `PersistenceRequest` and `PersistenceSnapshot` name the request/snapshot contract that later steps will hand to the writer.
- Phase 5 clarified which live-state mutations are durable, committed-live, transient automation, or render-local. Settled OSC commits are session-only by default and do not request persistence.

## Why Phase 6 Exists

Synchronous persistence is a poor fit for live software. A mutation that changes state should not also have to block on filesystem timing, antivirus scans, slow disks, or transient IO failures. The app needs persistence to be reliable and observable, but not timing-sensitive.

The resilience review calls this out because `SavePersistentState()` style behavior can create unnecessary stalls and makes recovery harder to reason about.

Phase 6 should turn persistence into:

- request
- snapshot
- background write
- completion/failure observation

## Goals

Phase 6 should establish:

- a queued persistence request path
- debounced/coalesced durable-state snapshot writes
- atomic file replacement for runtime-state saves where practical
- structured completion/failure reporting
- clear separation between state mutation and disk flush
- deterministic shutdown flushing policy
- tests for coalescing, snapshot selection, write failure, and shutdown behavior without rendering or DeckLink

## Non-Goals

Phase 6 should not require:

- changing live-state layering rules
- changing DeckLink/backend lifecycle
- replacing stack preset semantics wholesale
- adding cloud sync or external storage
- building an unlimited historical state archive
- making every write async immediately if a narrow compatibility path still needs a synchronous result

## Target Model

Phase 6 should make persistence a small pipeline:

```text
RuntimeCoordinator accepts mutation
  -> publishes/returns persistence request
  -> PersistenceWriter captures a durable snapshot from RuntimeStore serialization
  -> background worker debounces/coalesces writes
  -> atomic write commits file
  -> HealthTelemetry/runtime event records success or failure
```

The key rule is:

- `RuntimeStore` owns durable state and serialization
- `CommittedLiveState` owns current session state; only coordinator-approved durable snapshots should be persisted
- `PersistenceWriter` owns when and how snapshots are written
- `RuntimeCoordinator` owns whether a mutation requests persistence

## Proposed Collaborators

### `PersistenceWriter`

Owns the worker thread, queue, debounce timer, and write execution.

Responsibilities:

- accept persistence requests
- coalesce repeated runtime-state writes
- request/build a durable snapshot from `RuntimeStore`
- write to a temporary file and atomically replace the target
- report success/failure observations
- flush on shutdown according to policy

Non-responsibilities:

- deciding mutation validity
- owning durable in-memory state
- composing render snapshots
- blocking render/backend timing paths

### `PersistenceSnapshot`

Immutable write input captured from durable state.

Responsibilities:

- contain serialized runtime-state text or structured data ready to serialize
- identify target path and snapshot generation
- preserve enough metadata for completion/failure diagnostics

Non-responsibilities:

- mutation policy
- file IO

### `PersistenceRequest`

Small request object or event payload.

Expected fields:

- reason/action name
- target kind: runtime state, preset, config if later needed
- optional debounce key
- force/flush flag for explicit save operations
- generation or sequence

## Write Policy

### Runtime State

Default policy:

- coalesce repeated requests
- debounce short bursts
- write newest snapshot
- report failures without blocking render/control paths

### Stack Presets

Preset save is more operator-explicit than routine runtime-state persistence.

Initial policy options:

- keep preset save synchronous while runtime-state persistence becomes async
- or route preset writes through the same worker with a completion result for the caller

Conservative Phase 6 default:

- background runtime-state persistence first
- leave preset save/load synchronous unless the implementation has a clean completion path

### Shutdown

Shutdown should explicitly decide:

- flush latest pending snapshot before exit
- skip flush if no pending durable change exists
- report/write failure if flush fails
- avoid indefinite hang on shutdown

## Atomicity And Failure Handling

Runtime-state writes should prefer:

1. serialize snapshot content in memory
2. write to `target.tmp`
3. flush/close file
4. replace target atomically where platform support allows
5. retain or report backup/failure context if replacement fails

Failures should not silently disappear. They should publish:

- persistence target
- reason/action
- error message
- whether a newer request is pending
- whether the app is still running with unsaved changes

## Migration Plan

### Step 1. Name Persistence Requests

Make request types and event payloads explicit enough that callers stop thinking in terms of direct disk writes.

Initial target:

- [x] keep existing coordinator persistence decisions
- [x] introduce a `PersistenceRequest`/`PersistenceSnapshot` shape
- [x] document which requests are debounceable

Current implementation:

- `runtime/persistence/PersistenceRequest.h` defines `PersistenceTargetKind`, `PersistenceRequest`, and `PersistenceSnapshot`.
- `RuntimePersistenceRequestedEvent` carries a typed `PersistenceRequest`.
- `RuntimeCoordinator` emits runtime-state persistence requests with reason, debounce key, and debounce policy.
- Existing synchronous save behavior is intentionally unchanged until Step 2/3.

### Step 2. Extract Snapshot Writing From `RuntimeStore`

Move file-write mechanics behind a helper while keeping serialization ownership in `RuntimeStore`.

Initial target:

- [x] `RuntimeStore` can build serialized runtime-state snapshots
- [x] `PersistenceWriter` writes the snapshot
- [x] existing synchronous save path can call through the writer/helper during transition

Current implementation:

- `RuntimeStore::BuildRuntimeStatePersistenceSnapshot(...)` captures serialized runtime-state content and target path.
- `PersistenceWriter::WriteSnapshot(...)` owns the temp-file and replace write mechanics.
- `RuntimeStore::SavePersistentState(...)` still behaves synchronously, but now writes through `PersistenceWriter`.
- Stack preset saves also use `PersistenceWriter` synchronously; preset async policy remains a later decision.

### Step 3. Add Debounced Background Worker

Introduce a worker thread or queued task owner.

Initial target:

- [x] repeated runtime-state requests coalesce
- [x] worker writes only latest pending snapshot
- [x] tests cover coalescing without filesystem where possible

Current implementation:

- `PersistenceWriter::EnqueueSnapshot(...)` starts a worker lazily and debounces snapshots by `debounceKey`.
- Runtime-state saves enqueue debounced snapshots, so routine mutation paths no longer write the runtime-state file directly.
- Synchronous `PersistenceWriter::WriteSnapshot(...)` remains for stack preset saves and transitional direct writes.
- `PersistenceWriterTests` use an injected in-memory sink to verify coalescing and non-coalesced immediate requests without touching the filesystem.

### Step 4. Add Atomic Write And Failure Reporting

Make disk writes safer and observable.

Initial target:

- [x] temp-file then replace
- [x] failure returned/published with structured reason
- [x] `HealthTelemetry` receives persistence warning state

Current implementation:

- `PersistenceWriter::WriteSnapshot(...)` and worker writes use temp-file then `MoveFileExA(..., MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)`.
- `PersistenceWriteResult` reports target kind, target path, reason, success/failure, error message, and whether newer work was pending.
- `RuntimeStore` wires persistence write results into `HealthTelemetry`.
- `HealthTelemetry` records persistence success/failure counts, last target/reason/error, pending-newer-request state, and unsaved-change state.

### Step 5. Wire Coordinator/Event Requests To Writer

Route `RuntimePersistenceRequested` or coordinator persistence outcomes into the writer.

Initial target:

- accepted durable mutations request persistence
- transient-only mutations do not
- runtime reload/preset policies remain explicit

### Step 6. Define Shutdown Flush

Make app shutdown persistence behavior deterministic.

Initial target:

- stop accepting new requests
- flush latest pending snapshot with bounded wait
- report failure if flush fails

## Testing Strategy

Recommended tests:

- repeated persistence requests coalesce into one write
- newest snapshot wins after multiple mutations
- transient-only mutation does not request persistence
- write failure records an error and keeps unsaved state visible
- shutdown flush writes pending snapshot
- shutdown with no pending request does not write
- preset save path remains explicit
- temp-file replacement success/failure is handled

Useful homes:

- `RuntimeSubsystemTests` for coordinator persistence outcomes
- a new `PersistenceWriterTests` target for worker/coalescing/write policy
- filesystem tests using a temporary directory for atomic write behavior

## Risks

### Data Loss Risk

Debouncing introduces a window where in-memory state is newer than disk. Shutdown flush and unsaved-state telemetry are the guardrails.

### Complexity Risk

A persistence worker can become a hidden second store if it owns mutable truth. It should own snapshots and write policy only.

### Blocking Shutdown Risk

Flushing forever on shutdown is not acceptable. Use bounded waits and visible failure reporting.

### Preset Semantics Risk

Operator-triggered preset save often feels like it should complete before reporting success. Keep preset behavior explicit rather than silently changing it.

## Phase 6 Exit Criteria

Phase 6 can be considered complete once the project can say:

- [ ] durable mutations enqueue persistence instead of directly writing from mutation paths
- [ ] runtime-state writes are debounced/coalesced
- [ ] writes use temp-file/replace or equivalent atomic policy
- [ ] persistence failures are reported through structured health/events
- [ ] transient/live-only mutations do not request persistence
- [ ] shutdown flush behavior is explicit and tested
- [ ] `RuntimeStore` remains durable-state/serialization owner, not worker policy owner
- [ ] persistence behavior has focused non-render tests

## Open Questions

- Should preset save remain synchronous, or move behind a completion-based async request?
- What debounce interval is appropriate for routine runtime-state writes?
- Should failed persistence retry automatically, or wait for the next mutation/request?
- Should the app expose "unsaved changes" in the UI/health snapshot?
- Should runtime config writes share this worker, or stay separate?

## Short Version

Phase 6 should make persistence boring, safe, and off the hot path.

Mutations update in-memory durable state. Persistence requests are queued and coalesced. A background writer saves atomic snapshots and reports failures. Render, backend callbacks, and control ingress should not pay filesystem costs.