fix: catch uncaught exception in onApply to prevent apply pipeline stall#1246
fix: catch uncaught exception in onApply to prevent apply pipeline stall#1246Qian-Cheng-nju wants to merge 1 commit intosofastack:masterfrom
Conversation
📝 WalkthroughWalkthroughThe changes add error handling for uncaught exceptions in the state machine's Changes
Sequence DiagramsequenceDiagram
participant FSM as FSMCaller
participant Task as doApplyTasks
participant Handler as User StateMachine
participant Iter as Iterator
rect rgba(255, 0, 0, 0.5)
Note over FSM,Iter: Before: Exception Causes Infinite Stall
FSM->>Task: execute(iterImpl)
Task->>Handler: onApply(entry)
Handler--XTask: throws Throwable
Note over Task: Exception not caught<br/>lastAppliedIndex not advanced
Task-->>FSM: exception bubbles up
Note over FSM: Next doCommitted() retries<br/>same entry → infinite loop
end
rect rgba(0, 255, 0, 0.5)
Note over FSM,Iter: After: Exception Handled Gracefully
FSM->>Task: try { execute(iterImpl) }
Task->>Handler: onApply(entry)
Handler--XTask: throws Throwable
Task-->>FSM: catch(Throwable)
FSM->>FSM: log error
FSM->>Iter: setErrorAndRollback(1, ESTATEMACHINE)
FSM->>FSM: break (stop processing)
Note over FSM: Node transitions to ERROR state<br/>No infinite retry, callbacks notified
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip Migrating from UI to YAML configuration.Use the |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java (1)
150-151: Replace fixed sleep with bounded condition wait to reduce test flakiness.
Thread.sleep(3000)can be timing-sensitive and slow CI; prefer polling with timeout until the expected condition is met.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java` around lines 150 - 151, In OnApplyExceptionTest replace the fixed Thread.sleep(3000) with a bounded condition wait: poll the expected condition (that the exception fired and the error state has been set) in a short-interval loop or use a synchronization primitive (e.g., CountDownLatch.await with timeout or Awaitility) until it becomes true or a reasonable timeout elapses; update the check around the Thread.sleep location in OnApplyExceptionTest to break out early when the condition is met and fail the test if the timeout is reached to avoid flakiness and long CI waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`:
- Around line 143-164: Add assertions that verify the poison task's closure was
executed and that it received an error status: change the Task callback passed
to node.apply to record the callback firing and capture the provided status (use
the existing poisonCallbackFired and poisonLatch and add an AtomicReference for
the status, e.g., poisonStatusRef), then wait on poisonLatch (with a timeout)
and assert poisonCallbackFired is true and poisonStatusRef.get() is non-null and
indicates an error state; place these checks after the Thread.sleep() /
exception count assertions to validate remaining closures are notified with an
error.
---
Nitpick comments:
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`:
- Around line 150-151: In OnApplyExceptionTest replace the fixed
Thread.sleep(3000) with a bounded condition wait: poll the expected condition
(that the exception fired and the error state has been set) in a short-interval
loop or use a synchronization primitive (e.g., CountDownLatch.await with timeout
or Awaitility) until it becomes true or a reasonable timeout elapses; update the
check around the Thread.sleep location in OnApplyExceptionTest to break out
early when the condition is met and fail the test if the timeout is reached to
avoid flakiness and long CI waits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 86437b88-1364-4362-9147-2bf12409511a
📒 Files selected for processing (2)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/FSMCallerImpl.javajraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java
| final CountDownLatch poisonLatch = new CountDownLatch(1); | ||
| final AtomicBoolean poisonCallbackFired = new AtomicBoolean(false); | ||
| node.apply(new Task(ByteBuffer.wrap(poisonPayload.getBytes()), status -> { | ||
| poisonCallbackFired.set(true); | ||
| poisonLatch.countDown(); | ||
| })); | ||
|
|
||
| // Wait for the exception to fire and the error to be set | ||
| Thread.sleep(3000); | ||
|
|
||
| // With the fix: the exception is caught once, the node enters error | ||
| // state, and the closures are properly notified. The exception should | ||
| // fire exactly once (not repeatedly as in the stalled case). | ||
| final long finalExceptionCount = exceptionCount.get(); | ||
| System.out.println("Exception count: " + finalExceptionCount); | ||
|
|
||
| // The exception should have fired at most a small number of times | ||
| // (ideally once). Before the fix, it would fire continuously. | ||
| assertTrue("Exception should have fired", finalExceptionCount >= 1); | ||
| assertTrue("Exception should not fire in an infinite loop (was " + finalExceptionCount + ")", | ||
| finalExceptionCount <= 3); | ||
|
|
There was a problem hiding this comment.
Test misses assertion for poison task callback/error status.
The test sets poisonCallbackFired and poisonLatch (Line 143–Line 148) but never asserts them, so it does not actually validate that remaining closures are notified with error as described.
✅ Suggested test assertion patch
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.atomic.AtomicReference;
@@
final CountDownLatch poisonLatch = new CountDownLatch(1);
final AtomicBoolean poisonCallbackFired = new AtomicBoolean(false);
+ final AtomicReference<Status> poisonStatus = new AtomicReference<>();
node.apply(new Task(ByteBuffer.wrap(poisonPayload.getBytes()), status -> {
+ poisonStatus.set(status);
poisonCallbackFired.set(true);
poisonLatch.countDown();
}));
@@
assertTrue("Exception should have fired", finalExceptionCount >= 1);
assertTrue("Exception should not fire in an infinite loop (was " + finalExceptionCount + ")",
finalExceptionCount <= 3);
+ assertTrue("Poison callback should be invoked", poisonLatch.await(5, TimeUnit.SECONDS));
+ assertTrue("Poison callback flag should be true", poisonCallbackFired.get());
+ assertNotNull("Poison callback status should be captured", poisonStatus.get());
+ assertFalse("Poison callback should receive error status", poisonStatus.get().isOk());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`
around lines 143 - 164, Add assertions that verify the poison task's closure was
executed and that it received an error status: change the Task callback passed
to node.apply to record the callback firing and capture the provided status (use
the existing poisonCallbackFired and poisonLatch and add an AtomicReference for
the status, e.g., poisonStatusRef), then wait on poisonLatch (with a timeout)
and assert poisonCallbackFired is true and poisonStatusRef.get() is non-null and
indicates an error state; place these checks after the Thread.sleep() /
exception count assertions to validate remaining closures are notified with an
error.
Motivation
If
StateMachine.onApply()throws (NPE, deserialization error, etc.), the exception propagates out ofdoApplyTasks()(line 562), skippingsetLastApplied()(line 572).lastAppliedIndexis never advanced. NextdoCommitted()retries the same entries, hits the same exception, loops forever. The node is alive but the FSM is permanently stuck.Modification
Catch
ThrowablearounddoApplyTasks(), callsetErrorAndRollback()to enter the existing error handling path. After the catch,setError()+runTheRestClosureWithError()run normally, andsetLastApplied()advances the index.Test:
OnApplyExceptionTest#testOnApplyExceptionSetsErrorInsteadOfStalling— single-node cluster with aStateMachinethat throws on a specific entry. Exception fires exactly once (before the fix it grows unbounded).Result
Fixes #1244 (item 3).
Summary by CodeRabbit