Skip to content

feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7)#13

Merged
tcconnally merged 1 commit into
mainfrom
feat/epic2-reliability
Jul 2, 2026
Merged

feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7)#13
tcconnally merged 1 commit into
mainfrom
feat/epic2-reliability

Conversation

@tcconnally

Copy link
Copy Markdown
Owner

Implements Epic 2: 24/7 Reliability & Self-Healing (closes #7) — the highest operational-value chunk of v10.

Stacked on #12 (Epic 1). Base branch is feat/epic1-identity-unification so this diff shows only Epic 2. Merge #12 first, then this retargets to main cleanly.

Why

The retry/escalation chain only fired on EOF or explicit error. A silent mid-stream freeze (frozen frame, wedged decoder, network hang that survives reconnect=1) emits neither — the cell would sit dead forever. For an always-on wall that's the #1 reliability gap.

What changed

New hyperwall/reliability.py — pure, dependency-free decision logic (is_stalled, count_recent, should_park, scale_demuxer_mb) so the self-healing behavior is unit-testable without PyQt/mpv/Emby.

Stall watchdog (cell.py) — the time-pos observer records forward-progress timestamps; a QTimer (WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced for STALL_TIMEOUT_S while actively playing (paused/seeking exempt), then runs the existing retry/escalation chain — no new failure semantics.

Crash-loop guard (cell.py) — failures are timestamped; CRASH_LOOP_THRESHOLD within CRASH_LOOP_WINDOW_S parks the cell on a "media unavailable" card instead of hammering Emby, auto-resuming after CRASH_LOOP_COOLDOWN_S.

_switching leak fix (cell.py) — the guard set in play() was only cleared on the first eof; the error branch returned without clearing it, so a later genuine EOF could be swallowed. Now cleared on both paths.

Memory-aware cache budget (constants.py + wall.py) — apply_cache_budget() scales per-cell demuxer_max_bytes so the grid total stays under CACHE_BUDGET_MB. A 6×6 grid at 512 MiB/cell would otherwise reach ~18 GB; now it's ~85 MiB/cell under the 3072 MiB default. The controller assigns budgeted opts to every cell once the grid size is known.

All tunables are env-overridable (clamped) and documented in the README.

Validation (Linux)

  • 16/16 new unit tests pass (tests/test_reliability.py): stall threshold boundaries, paused/seeking exemptions, rolling-window crash-loop aging, budget scaling + floor + zero-cell safety.
  • Repo guards still 8/8 (drift guard clean on all changed modules).
  • py_compile clean; env overrides verified at runtime (e.g. HYPERWALL_CACHE_BUDGET_MB=1024, 16 cells → 64 MiB/cell).
  • CI workflow updated to run the reliability suite.

⚠️ Still needs a Windows/live pass on skyhawk

Per the project guardrail (don't trust a fix from inspection alone), the runtime behaviors need a real check:

  1. Induce a stall (pause the Emby stream / pull the network) and confirm the watchdog fires after ~20s and the cell recovers.
  2. Point at a dead item repeatedly and confirm the cell parks + shows the card + resumes after cooldown.
  3. Sanity-check RAM on a large grid (e.g. 4×4+) reflects the budget.
  4. static=true DIRECT path is untouched — but smoke-test playback anyway.

Milestone: v10 · roadmap #5.

…et (Epic 2, #7)

Adds self-healing for an always-on wall, plus the _switching error-path leak
fix. Pure decision logic lives in a new dependency-free reliability.py so it's
unit-testable without PyQt/mpv/Emby.

Stall watchdog (cell.py):
- time-pos observer records forward-progress timestamps; a QTimer
  (WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced for
  STALL_TIMEOUT_S while actively playing, then runs the existing
  retry/escalation chain. Silent freezes (frozen frame / wedged decoder /
  network hang surviving reconnect) previously emitted no end-file or error
  and would hang forever.

Crash-loop guard (cell.py):
- failures are timestamped; CRASH_LOOP_THRESHOLD within CRASH_LOOP_WINDOW_S
  parks the cell on a "media unavailable" card and stops hammering Emby,
  auto-resuming after CRASH_LOOP_COOLDOWN_S.

_switching leak fix (cell.py):
- the guard set in play() is now cleared on the error branch of _handle_eof
  too (was only cleared on the first eof), so a later genuine EOF can't be
  swallowed.

Memory-aware cache budget (constants.py + wall.py):
- apply_cache_budget() scales per-cell demuxer_max_bytes so the grid total
  stays under CACHE_BUDGET_MB (a 6x6 grid at 512 MiB/cell would reach ~18 GB);
  controller assigns budgeted opts to every cell once grid size is known.

Tunables (all env-overridable, clamped) in constants.py; documented in README.

Tests: tests/test_reliability.py — 16 pure unit tests (stall boundaries,
paused/seeking exemptions, rolling-window crash-loop, budget scaling + floor +
zero-cell safety). Wired into the CI workflow. Repo guards still 8/8.

Refs #7
@tcconnally tcconnally added this to the v10 milestone Jul 2, 2026
@tcconnally tcconnally deleted the branch main July 2, 2026 14:21
@tcconnally tcconnally closed this Jul 2, 2026
@tcconnally tcconnally reopened this Jul 2, 2026
@tcconnally tcconnally changed the base branch from feat/epic1-identity-unification to main July 2, 2026 14:22
@tcconnally tcconnally merged commit 8233a12 into main Jul 2, 2026
2 checks passed
@tcconnally tcconnally deleted the feat/epic2-reliability branch July 2, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Epic 2: 24/7 Reliability & Self-Healing (stall watchdog + fixes)

2 participants