feat: 24/7 Reliability — stall watchdog + crash-loop guard + cache budget (Epic 2, #7)#13
Merged
Merged
Conversation
…et (Epic 2, #7) Adds self-healing for an always-on wall, plus the _switching error-path leak fix. Pure decision logic lives in a new dependency-free reliability.py so it's unit-testable without PyQt/mpv/Emby. Stall watchdog (cell.py): - time-pos observer records forward-progress timestamps; a QTimer (WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced for STALL_TIMEOUT_S while actively playing, then runs the existing retry/escalation chain. Silent freezes (frozen frame / wedged decoder / network hang surviving reconnect) previously emitted no end-file or error and would hang forever. Crash-loop guard (cell.py): - failures are timestamped; CRASH_LOOP_THRESHOLD within CRASH_LOOP_WINDOW_S parks the cell on a "media unavailable" card and stops hammering Emby, auto-resuming after CRASH_LOOP_COOLDOWN_S. _switching leak fix (cell.py): - the guard set in play() is now cleared on the error branch of _handle_eof too (was only cleared on the first eof), so a later genuine EOF can't be swallowed. Memory-aware cache budget (constants.py + wall.py): - apply_cache_budget() scales per-cell demuxer_max_bytes so the grid total stays under CACHE_BUDGET_MB (a 6x6 grid at 512 MiB/cell would reach ~18 GB); controller assigns budgeted opts to every cell once grid size is known. Tunables (all env-overridable, clamped) in constants.py; documented in README. Tests: tests/test_reliability.py — 16 pure unit tests (stall boundaries, paused/seeking exemptions, rolling-window crash-loop, budget scaling + floor + zero-cell safety). Wired into the CI workflow. Repo guards still 8/8. Refs #7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements Epic 2: 24/7 Reliability & Self-Healing (closes #7) — the highest operational-value chunk of v10.
Why
The retry/escalation chain only fired on EOF or explicit error. A silent mid-stream freeze (frozen frame, wedged decoder, network hang that survives
reconnect=1) emits neither — the cell would sit dead forever. For an always-on wall that's the #1 reliability gap.What changed
New
hyperwall/reliability.py— pure, dependency-free decision logic (is_stalled,count_recent,should_park,scale_demuxer_mb) so the self-healing behavior is unit-testable without PyQt/mpv/Emby.Stall watchdog (
cell.py) — thetime-posobserver records forward-progress timestamps; aQTimer(WATCHDOG_INTERVAL_MS) flags a cell whose position hasn't advanced forSTALL_TIMEOUT_Swhile actively playing (paused/seeking exempt), then runs the existing retry/escalation chain — no new failure semantics.Crash-loop guard (
cell.py) — failures are timestamped;CRASH_LOOP_THRESHOLDwithinCRASH_LOOP_WINDOW_Sparks the cell on a "media unavailable" card instead of hammering Emby, auto-resuming afterCRASH_LOOP_COOLDOWN_S._switchingleak fix (cell.py) — the guard set inplay()was only cleared on the firsteof; theerrorbranch returned without clearing it, so a later genuine EOF could be swallowed. Now cleared on both paths.Memory-aware cache budget (
constants.py+wall.py) —apply_cache_budget()scales per-celldemuxer_max_bytesso the grid total stays underCACHE_BUDGET_MB. A 6×6 grid at 512 MiB/cell would otherwise reach ~18 GB; now it's ~85 MiB/cell under the 3072 MiB default. The controller assigns budgeted opts to every cell once the grid size is known.All tunables are env-overridable (clamped) and documented in the README.
Validation (Linux)
tests/test_reliability.py): stall threshold boundaries, paused/seeking exemptions, rolling-window crash-loop aging, budget scaling + floor + zero-cell safety.py_compileclean; env overrides verified at runtime (e.g.HYPERWALL_CACHE_BUDGET_MB=1024, 16 cells → 64 MiB/cell).Per the project guardrail (don't trust a fix from inspection alone), the runtime behaviors need a real check:
static=trueDIRECT path is untouched — but smoke-test playback anyway.Milestone: v10 · roadmap #5.