feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7) by tcconnally · Pull Request #13 · tcconnally/hyperwall

tcconnally · 2026-07-02T13:36:13Z

Implements Epic 2: 24/7 Reliability & Self-Healing (closes #7) — the highest operational-value chunk of v10.

Stacked on #12 (Epic 1). Base branch is feat/epic1-identity-unification so this diff shows only Epic 2. Merge #12 first, then this retargets to main cleanly.

Why

The retry/escalation chain only fired on EOF or explicit error. A silent mid-stream freeze (frozen frame, wedged decoder, network hang that survives reconnect=1) emits neither — the cell would sit dead forever. For an always-on wall that's the #1 reliability gap.

What changed

New hyperwall/reliability.py — pure, dependency-free decision logic (is_stalled, count_recent, should_park, scale_demuxer_mb) so the self-healing behavior is unit-testable without PyQt/mpv/Emby.

Stall watchdog (cell.py) — the time-pos observer records forward-progress timestamps; a QTimer (WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced for STALL_TIMEOUT_S while actively playing (paused/seeking exempt), then runs the existing retry/escalation chain — no new failure semantics.

Crash-loop guard (cell.py) — failures are timestamped; CRASH_LOOP_THRESHOLD within CRASH_LOOP_WINDOW_S parks the cell on a "media unavailable" card instead of hammering Emby, auto-resuming after CRASH_LOOP_COOLDOWN_S.

_switching leak fix (cell.py) — the guard set in play() was only cleared on the first eof; the error branch returned without clearing it, so a later genuine EOF could be swallowed. Now cleared on both paths.

Memory-aware cache budget (constants.py + wall.py) — apply_cache_budget() scales per-cell demuxer_max_bytes so the grid total stays under CACHE_BUDGET_MB. A 6×6 grid at 512 MiB/cell would otherwise reach ~18 GB; now it's ~85 MiB/cell under the 3072 MiB default. The controller assigns budgeted opts to every cell once the grid size is known.

All tunables are env-overridable (clamped) and documented in the README.

Validation (Linux)

16/16 new unit tests pass (tests/test_reliability.py): stall threshold boundaries, paused/seeking exemptions, rolling-window crash-loop aging, budget scaling + floor + zero-cell safety.
Repo guards still 8/8 (drift guard clean on all changed modules).
py_compile clean; env overrides verified at runtime (e.g. HYPERWALL_CACHE_BUDGET_MB=1024, 16 cells → 64 MiB/cell).
CI workflow updated to run the reliability suite.

⚠️ Still needs a Windows/live pass on skyhawk

Per the project guardrail (don't trust a fix from inspection alone), the runtime behaviors need a real check:

Induce a stall (pause the Emby stream / pull the network) and confirm the watchdog fires after ~20s and the cell recovers.
Point at a dead item repeatedly and confirm the cell parks + shows the card + resumes after cooldown.
Sanity-check RAM on a large grid (e.g. 4×4+) reflects the budget.
static=true DIRECT path is untouched — but smoke-test playback anyway.

Milestone: v10 · roadmap #5.

…et (Epic 2, #7) Adds self-healing for an always-on wall, plus the _switching error-path leak fix. Pure decision logic lives in a new dependency-free reliability.py so it's unit-testable without PyQt/mpv/Emby. Stall watchdog (cell.py): - time-pos observer records forward-progress timestamps; a QTimer (WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced for STALL_TIMEOUT_S while actively playing, then runs the existing retry/escalation chain. Silent freezes (frozen frame / wedged decoder / network hang surviving reconnect) previously emitted no end-file or error and would hang forever. Crash-loop guard (cell.py): - failures are timestamped; CRASH_LOOP_THRESHOLD within CRASH_LOOP_WINDOW_S parks the cell on a "media unavailable" card and stops hammering Emby, auto-resuming after CRASH_LOOP_COOLDOWN_S. _switching leak fix (cell.py): - the guard set in play() is now cleared on the error branch of _handle_eof too (was only cleared on the first eof), so a later genuine EOF can't be swallowed. Memory-aware cache budget (constants.py + wall.py): - apply_cache_budget() scales per-cell demuxer_max_bytes so the grid total stays under CACHE_BUDGET_MB (a 6x6 grid at 512 MiB/cell would reach ~18 GB); controller assigns budgeted opts to every cell once grid size is known. Tunables (all env-overridable, clamped) in constants.py; documented in README. Tests: tests/test_reliability.py — 16 pure unit tests (stall boundaries, paused/seeking exemptions, rolling-window crash-loop, budget scaling + floor + zero-cell safety). Wired into the CI workflow. Repo guards still 8/8. Refs #7

tcconnally added this to the v10 milestone Jul 2, 2026

tcconnally deleted the branch main July 2, 2026 14:21

tcconnally closed this Jul 2, 2026

tcconnally reopened this Jul 2, 2026

tcconnally changed the base branch from feat/epic1-identity-unification to main July 2, 2026 14:22

tcconnally merged commit 8233a12 into main Jul 2, 2026
2 checks passed

tcconnally deleted the feat/epic2-reliability branch July 2, 2026 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7)#13

feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7)#13
tcconnally merged 1 commit into
mainfrom
feat/epic2-reliability

tcconnally commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tcconnally commented Jul 2, 2026

Why

What changed

Validation (Linux)

⚠️ Still needs a Windows/live pass on skyhawk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants