Skip to content

miner: recover veblop producer stall when a build wedges with no pending task#2287

Open
lucca30 wants to merge 1 commit into
developfrom
lmartins/veblop-stall-fallback-fix
Open

miner: recover veblop producer stall when a build wedges with no pending task#2287
lucca30 wants to merge 1 commit into
developfrom
lmartins/veblop-stall-fallback-fix

Conversation

@lucca30

@lucca30 lucca30 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

While stress-testing the veblop block-production path, a producer can enter a permanent stall: it stops producing/importing at a fixed height and recovers only after a process restart. Symptoms are Possible producer stall: veblop fallback skipping while chain is stale logged repeatedly with a static currentBlock and pendingTasksCount=0 — the producer's own stall detector fires, but the fallback never actually recovers.

Root cause

newWorkLoop's veblop stall-fallback skipped recovery whenever pendingWorkBlock == currentBlock+1, treating that value as proof the next block was being built. But commitWork only clears pendingWorkBlock via a defer ...Store(0) that runs on return. If a build blocks inside commitWork and never returns, pendingWorkBlock stays pinned at currentBlock+1 with no sealing task (pendingTasks empty). In that state the fallback short-circuited forever — the producer never resubmitted work, and the only escape was a restart.

(A build blocking inside commitWork is a separate concern being investigated independently. This PR addresses the recovery path so that such a hang is no longer permanent.)

Fix

  • The fallback now treats pendingWorkBlock == next as "work in flight" only when a sealing task actually backs it. When pendingWorkBlock == next but no task exists, it distinguishes a wedged build from a slow-but-live one by how long the build has been outstanding (time since submit), not by block-timestamp age (chainAge is meaningless when the head carries an old timestamp, e.g. genesis). Recovery fires only past a 3×-blocktime threshold, so a live build is never interrupted.
  • The fallback re-commit uses a non-blocking send, so newWorkLoop cannot itself wedge on the unbuffered newWorkCh when mainLoop is the blocked goroutine. It keeps emitting the stall warning and retrying every tick, and auto-recovers the moment mainLoop is unblocked.
  • The decision is extracted into a pure decideVeblopFallback function so it can be unit-tested without driving the whole loop.

Testing

  • New table-driven unit test TestDecideVeblopFallback covering: the wedged-build case (the bug), the genesis in-progress-build case (regression guard against interrupting a live build), task-in-flight, the no-work-claimed paths, and the wedge threshold boundaries.
  • Verified as a true regression test: the suite goes red against the pre-fix logic (the wedged-build cases return "skip" instead of "recommit") and green against this fix.
  • Full miner/ package suite passes (this caught an over-aggressive first attempt that used chainAge and interrupted live builds at genesis).

Scope / follow-ups

This restores producer-loop liveness and auto-recovery once the underlying build unblocks. It does not by itself resolve a hard mainLoop deadlock; that underlying build-blocking concern is tracked separately.

The veblop stall-fallback in newWorkLoop skipped recovery whenever
pendingWorkBlock == currentBlock+1, trusting that value as proof the next
block was being built. But commitWork only clears pendingWorkBlock via a
deferred Store(0) that runs on return, so a build that blocks inside
commitWork (e.g. parked on the txpool lock in fillTransactions) leaves
pendingWorkBlock pinned with no sealing task. In that state the fallback
short-circuited forever and the producer stalled until a process restart --
surfaced while stress-testing the veblop producer path (producer frozen at a
fixed block, "veblop fallback skipping while chain is stale" logged every 30s
with a static currentBlock and pendingTasksCount=0).

Fix:
- Only treat pendingWorkBlock as "work in flight" when a sealing task
  actually backs it. When pendingWorkBlock==next but no task exists,
  distinguish a wedged build from a slow-but-live one by how long the build
  has been outstanding (time since submit), not by block-timestamp age
  (meaningless at genesis). Recover only past a 3x-blocktime threshold so a
  live build is never interrupted.
- Make the fallback re-commit a non-blocking send so newWorkLoop cannot
  itself wedge on the unbuffered newWorkCh when mainLoop is the blocked
  goroutine; it keeps emitting the stall warning and retrying every tick.

The recovery logic is extracted into a pure decideVeblopFallback function
with table-driven unit tests covering the wedged-build case, the genesis
in-progress-build regression, and the no-work-claimed paths.

Note: this restores producer-loop liveness and auto-recovers once mainLoop
is unblocked. It does not by itself resolve a hard mainLoop deadlock (the
underlying build-blocking concern is tracked separately).

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@sonarqubecloud

Copy link
Copy Markdown

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.36364% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.47%. Comparing base (5df879f) to head (1236875).

Files with missing lines Patch % Lines
miner/worker.go 86.36% 10 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (86.36%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2287      +/-   ##
===========================================
+ Coverage    53.44%   53.47%   +0.02%     
===========================================
  Files          896      896              
  Lines       159770   159840      +70     
===========================================
+ Hits         85397    85479      +82     
+ Misses       69054    69036      -18     
- Partials      5319     5325       +6     
Files with missing lines Coverage Δ
miner/worker.go 77.93% <86.36%> (-0.08%) ⬇️

... and 24 files with indirect coverage changes

Files with missing lines Coverage Δ
miner/worker.go 77.93% <86.36%> (-0.08%) ⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant