Skip to content

fix: catch uncaught exception in onApply to prevent apply pipeline stall#1246

Open
Qian-Cheng-nju wants to merge 1 commit intosofastack:masterfrom
specula-org:fix/onapply-exception-stalls-pipeline
Open

fix: catch uncaught exception in onApply to prevent apply pipeline stall#1246
Qian-Cheng-nju wants to merge 1 commit intosofastack:masterfrom
specula-org:fix/onapply-exception-stalls-pipeline

Conversation

@Qian-Cheng-nju
Copy link
Contributor

@Qian-Cheng-nju Qian-Cheng-nju commented Mar 16, 2026

Motivation

If StateMachine.onApply() throws (NPE, deserialization error, etc.), the exception propagates out of doApplyTasks() (line 562), skipping setLastApplied() (line 572). lastAppliedIndex is never advanced. Next doCommitted() retries the same entries, hits the same exception, loops forever. The node is alive but the FSM is permanently stuck.

Modification

Catch Throwable around doApplyTasks(), call setErrorAndRollback() to enter the existing error handling path. After the catch, setError() + runTheRestClosureWithError() run normally, and setLastApplied() advances the index.

Test: OnApplyExceptionTest#testOnApplyExceptionSetsErrorInsteadOfStalling — single-node cluster with a StateMachine that throws on a specific entry. Exception fires exactly once (before the fix it grows unbounded).

Result

Fixes #1244 (item 3).

Summary by CodeRabbit

  • Bug Fixes
    • Improved exception handling during log application. When a state machine encounters an error, the node now properly transitions to an error state instead of stalling, ensuring completion callbacks receive appropriate error notifications.

@sofastack-cla sofastack-cla bot added bug Something isn't working cla:yes size/L labels Mar 16, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

The changes add error handling for uncaught exceptions in the state machine's onApply() method by wrapping task execution in a try-catch block, and include a comprehensive test that validates the exception is caught and prevents infinite retry loops.

Changes

Cohort / File(s) Summary
FSM Exception Handling
jraft-core/src/main/java/com/alipay/sofa/jraft/core/FSMCallerImpl.java
Wrapped doApplyTasks() execution in try-catch to catch Throwable, log the error, invoke setErrorAndRollback() with RaftError.ESTATEMACHINE status, and break the loop to prevent stalling when user state machine throws an exception.
Exception Test Coverage
jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java
New test class validating that uncaught exceptions in StateMachine.onApply() during log application are caught, set the node to error state, don't cause infinite retry loops, and deliver error status to completion callbacks.

Sequence Diagram

sequenceDiagram
    participant FSM as FSMCaller
    participant Task as doApplyTasks
    participant Handler as User StateMachine
    participant Iter as Iterator

    rect rgba(255, 0, 0, 0.5)
    Note over FSM,Iter: Before: Exception Causes Infinite Stall
    FSM->>Task: execute(iterImpl)
    Task->>Handler: onApply(entry)
    Handler--XTask: throws Throwable
    Note over Task: Exception not caught<br/>lastAppliedIndex not advanced
    Task-->>FSM: exception bubbles up
    Note over FSM: Next doCommitted() retries<br/>same entry → infinite loop
    end

    rect rgba(0, 255, 0, 0.5)
    Note over FSM,Iter: After: Exception Handled Gracefully
    FSM->>Task: try { execute(iterImpl) }
    Task->>Handler: onApply(entry)
    Handler--XTask: throws Throwable
    Task-->>FSM: catch(Throwable)
    FSM->>FSM: log error
    FSM->>Iter: setErrorAndRollback(1, ESTATEMACHINE)
    FSM->>FSM: break (stop processing)
    Note over FSM: Node transitions to ERROR state<br/>No infinite retry, callbacks notified
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug, size/S

Suggested reviewers

  • fengjiachun

Poem

🐰 When errors leap in FSM's domain,
We catch them swift to break the chain!
No stalling loops or endless dread—
Just graceful handling instead.
Exception tamed, the pipeline flows,
And every test its virtue shows! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: catching uncaught exceptions in onApply to prevent apply pipeline stall.
Linked Issues check ✅ Passed The PR directly addresses objective #3 from issue #1244 by wrapping doApplyTasks() with a catch(Throwable) that calls setErrorAndRollback() to prevent FSM apply pipeline stall when StateMachine.onApply() throws.
Out of Scope Changes check ✅ Passed All code changes are scoped to fixing the onApply exception handling issue; no unrelated modifications to doCommitted(), test infrastructure, or other objectives are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java (1)

150-151: Replace fixed sleep with bounded condition wait to reduce test flakiness.

Thread.sleep(3000) can be timing-sensitive and slow CI; prefer polling with timeout until the expected condition is met.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`
around lines 150 - 151, In OnApplyExceptionTest replace the fixed
Thread.sleep(3000) with a bounded condition wait: poll the expected condition
(that the exception fired and the error state has been set) in a short-interval
loop or use a synchronization primitive (e.g., CountDownLatch.await with timeout
or Awaitility) until it becomes true or a reasonable timeout elapses; update the
check around the Thread.sleep location in OnApplyExceptionTest to break out
early when the condition is met and fail the test if the timeout is reached to
avoid flakiness and long CI waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`:
- Around line 143-164: Add assertions that verify the poison task's closure was
executed and that it received an error status: change the Task callback passed
to node.apply to record the callback firing and capture the provided status (use
the existing poisonCallbackFired and poisonLatch and add an AtomicReference for
the status, e.g., poisonStatusRef), then wait on poisonLatch (with a timeout)
and assert poisonCallbackFired is true and poisonStatusRef.get() is non-null and
indicates an error state; place these checks after the Thread.sleep() /
exception count assertions to validate remaining closures are notified with an
error.

---

Nitpick comments:
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`:
- Around line 150-151: In OnApplyExceptionTest replace the fixed
Thread.sleep(3000) with a bounded condition wait: poll the expected condition
(that the exception fired and the error state has been set) in a short-interval
loop or use a synchronization primitive (e.g., CountDownLatch.await with timeout
or Awaitility) until it becomes true or a reasonable timeout elapses; update the
check around the Thread.sleep location in OnApplyExceptionTest to break out
early when the condition is met and fail the test if the timeout is reached to
avoid flakiness and long CI waits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86437b88-1364-4362-9147-2bf12409511a

📥 Commits

Reviewing files that changed from the base of the PR and between 9f6f5dc and 747910e.

📒 Files selected for processing (2)
  • jraft-core/src/main/java/com/alipay/sofa/jraft/core/FSMCallerImpl.java
  • jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java

Comment on lines +143 to +164
final CountDownLatch poisonLatch = new CountDownLatch(1);
final AtomicBoolean poisonCallbackFired = new AtomicBoolean(false);
node.apply(new Task(ByteBuffer.wrap(poisonPayload.getBytes()), status -> {
poisonCallbackFired.set(true);
poisonLatch.countDown();
}));

// Wait for the exception to fire and the error to be set
Thread.sleep(3000);

// With the fix: the exception is caught once, the node enters error
// state, and the closures are properly notified. The exception should
// fire exactly once (not repeatedly as in the stalled case).
final long finalExceptionCount = exceptionCount.get();
System.out.println("Exception count: " + finalExceptionCount);

// The exception should have fired at most a small number of times
// (ideally once). Before the fix, it would fire continuously.
assertTrue("Exception should have fired", finalExceptionCount >= 1);
assertTrue("Exception should not fire in an infinite loop (was " + finalExceptionCount + ")",
finalExceptionCount <= 3);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Test misses assertion for poison task callback/error status.

The test sets poisonCallbackFired and poisonLatch (Line 143–Line 148) but never asserts them, so it does not actually validate that remaining closures are notified with error as described.

✅ Suggested test assertion patch
 import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.atomic.AtomicReference;
@@
         final CountDownLatch poisonLatch = new CountDownLatch(1);
         final AtomicBoolean poisonCallbackFired = new AtomicBoolean(false);
+        final AtomicReference<Status> poisonStatus = new AtomicReference<>();
         node.apply(new Task(ByteBuffer.wrap(poisonPayload.getBytes()), status -> {
+            poisonStatus.set(status);
             poisonCallbackFired.set(true);
             poisonLatch.countDown();
         }));
@@
         assertTrue("Exception should have fired", finalExceptionCount >= 1);
         assertTrue("Exception should not fire in an infinite loop (was " + finalExceptionCount + ")",
             finalExceptionCount <= 3);
+        assertTrue("Poison callback should be invoked", poisonLatch.await(5, TimeUnit.SECONDS));
+        assertTrue("Poison callback flag should be true", poisonCallbackFired.get());
+        assertNotNull("Poison callback status should be captured", poisonStatus.get());
+        assertFalse("Poison callback should receive error status", poisonStatus.get().isOk());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/OnApplyExceptionTest.java`
around lines 143 - 164, Add assertions that verify the poison task's closure was
executed and that it received an error status: change the Task callback passed
to node.apply to record the callback firing and capture the provided status (use
the existing poisonCallbackFired and poisonLatch and add an AtomicReference for
the status, e.g., poisonStatusRef), then wait on poisonLatch (with a timeout)
and assert poisonCallbackFired is true and poisonStatusRef.get() is non-null and
indicates an error state; place these checks after the Thread.sleep() /
exception count assertions to validate remaining closures are notified with an
error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cla:yes size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heads-up: additional bugs found

1 participant