miner: recover veblop producer stall when a build wedges with no pending task#2287
miner: recover veblop producer stall when a build wedges with no pending task#2287lucca30 wants to merge 1 commit into
Conversation
The veblop stall-fallback in newWorkLoop skipped recovery whenever pendingWorkBlock == currentBlock+1, trusting that value as proof the next block was being built. But commitWork only clears pendingWorkBlock via a deferred Store(0) that runs on return, so a build that blocks inside commitWork (e.g. parked on the txpool lock in fillTransactions) leaves pendingWorkBlock pinned with no sealing task. In that state the fallback short-circuited forever and the producer stalled until a process restart -- surfaced while stress-testing the veblop producer path (producer frozen at a fixed block, "veblop fallback skipping while chain is stale" logged every 30s with a static currentBlock and pendingTasksCount=0). Fix: - Only treat pendingWorkBlock as "work in flight" when a sealing task actually backs it. When pendingWorkBlock==next but no task exists, distinguish a wedged build from a slow-but-live one by how long the build has been outstanding (time since submit), not by block-timestamp age (meaningless at genesis). Recover only past a 3x-blocktime threshold so a live build is never interrupted. - Make the fallback re-commit a non-blocking send so newWorkLoop cannot itself wedge on the unbuffered newWorkCh when mainLoop is the blocked goroutine; it keeps emitting the stall warning and retrying every tick. The recovery logic is extracted into a pure decideVeblopFallback function with table-driven unit tests covering the wedged-build case, the genesis in-progress-build regression, and the no-work-claimed paths. Note: this restores producer-loop liveness and auto-recovers once mainLoop is unblocked. It does not by itself resolve a hard mainLoop deadlock (the underlying build-blocking concern is tracked separately).
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (86.36%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2287 +/- ##
===========================================
+ Coverage 53.44% 53.47% +0.02%
===========================================
Files 896 896
Lines 159770 159840 +70
===========================================
+ Hits 85397 85479 +82
+ Misses 69054 69036 -18
- Partials 5319 5325 +6
... and 24 files with indirect coverage changes
🚀 New features to boost your workflow:
|



Problem
While stress-testing the veblop block-production path, a producer can enter a permanent stall: it stops producing/importing at a fixed height and recovers only after a process restart. Symptoms are
Possible producer stall: veblop fallback skipping while chain is stalelogged repeatedly with a staticcurrentBlockandpendingTasksCount=0— the producer's own stall detector fires, but the fallback never actually recovers.Root cause
newWorkLoop's veblop stall-fallback skipped recovery wheneverpendingWorkBlock == currentBlock+1, treating that value as proof the next block was being built. ButcommitWorkonly clearspendingWorkBlockvia adefer ...Store(0)that runs on return. If a build blocks insidecommitWorkand never returns,pendingWorkBlockstays pinned atcurrentBlock+1with no sealing task (pendingTasksempty). In that state the fallback short-circuited forever — the producer never resubmitted work, and the only escape was a restart.(A build blocking inside
commitWorkis a separate concern being investigated independently. This PR addresses the recovery path so that such a hang is no longer permanent.)Fix
pendingWorkBlock == nextas "work in flight" only when a sealing task actually backs it. WhenpendingWorkBlock == nextbut no task exists, it distinguishes a wedged build from a slow-but-live one by how long the build has been outstanding (time since submit), not by block-timestamp age (chainAgeis meaningless when the head carries an old timestamp, e.g. genesis). Recovery fires only past a 3×-blocktime threshold, so a live build is never interrupted.newWorkLoopcannot itself wedge on the unbufferednewWorkChwhenmainLoopis the blocked goroutine. It keeps emitting the stall warning and retrying every tick, and auto-recovers the momentmainLoopis unblocked.decideVeblopFallbackfunction so it can be unit-tested without driving the whole loop.Testing
TestDecideVeblopFallbackcovering: the wedged-build case (the bug), the genesis in-progress-build case (regression guard against interrupting a live build), task-in-flight, the no-work-claimed paths, and the wedge threshold boundaries.miner/package suite passes (this caught an over-aggressive first attempt that usedchainAgeand interrupted live builds at genesis).Scope / follow-ups
This restores producer-loop liveness and auto-recovery once the underlying build unblocks. It does not by itself resolve a hard
mainLoopdeadlock; that underlying build-blocking concern is tracked separately.