-
Notifications
You must be signed in to change notification settings - Fork 8.9k
feature: implement early rollback of global transactions when TM disconnects #7674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2.x
Are you sure you want to change the base?
feature: implement early rollback of global transactions when TM disconnects #7674
Conversation
…onnects - TMDisconnectHandler interface for handling TM disconnect events - DefaultTMDisconnectHandler implementation with VGroup-based matching - Configuration option: server.enableRollbackWhenDisconnect (default: false) - Integration with AbstractNettyRemotingServer for TM disconnect detection - Comprehensive test coverage with unit and integration tests
c599475 to
f772c72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements early rollback of global transactions when Transaction Manager (TM) disconnects to improve system performance by reducing resource lock duration from 60 seconds to <1 second.
Key Changes:
- Added TM disconnect detection and handling infrastructure
- Implemented VGroup-based transaction matching for early rollback
- Added configuration toggle for the feature (disabled by default)
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
TMDisconnectHandler.java |
Interface for handling TM disconnect events |
DefaultTMDisconnectHandler.java |
Implementation with VGroup/ApplicationId matching logic |
AbstractNettyRemotingServer.java |
Integration of TM disconnect detection in server |
DefaultCoordinator.java |
Initialization of TM disconnect handler |
ConfigurationKeys.java & DefaultValues.java |
Configuration constants for feature toggle |
| Test files | Comprehensive unit and integration tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| } catch (TransactionException e) { | ||
| LOGGER.error( | ||
| "Failed to rollback transaction [{}] {} {}", |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message format is unclear with three consecutive placeholders without descriptive text. Consider a more descriptive format like 'Failed to rollback transaction [{}]: code={}, message={}'
| "Failed to rollback transaction [{}] {} {}", | |
| "Failed to rollback transaction [{}]: code={}, message={}", |
| * | ||
| * @return the session manager | ||
| */ | ||
| public SessionManager getSessionManager() { |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is exposed solely for testing but exists in production code. Consider using package-private visibility instead of public, or use dependency injection to make the code more testable.
| public SessionManager getSessionManager() { | |
| SessionManager getSessionManager() { |
|
Where did you handle the |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 2.x #7674 +/- ##
============================================
+ Coverage 61.05% 61.76% +0.70%
- Complexity 670 684 +14
============================================
Files 1316 1325 +9
Lines 49804 50104 +300
Branches 5855 5917 +62
============================================
+ Hits 30407 30945 +538
+ Misses 16689 16373 -316
- Partials 2708 2786 +78
🚀 New features to boost your workflow:
|
@YongGoose Flow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Get the session manager instance. Made public for testing purposes. | ||
| * | ||
| * @return the session manager | ||
| */ | ||
| public SessionManager getSessionManager() { | ||
| return SessionHolder.getRootSessionManager(); | ||
| } |
Copilot
AI
Nov 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getSessionManager() method appears to be unused. The comment states it's "Made public for testing purposes," but it's not used in any test files (DefaultTMDisconnectHandlerTest.java or TMDisconnectIntegrationTest.java). The method simply delegates to SessionHolder.getRootSessionManager(), which tests can call directly. Consider removing this method to reduce unnecessary public API surface.
| /** | |
| * Get the session manager instance. Made public for testing purposes. | |
| * | |
| * @return the session manager | |
| */ | |
| public SessionManager getSessionManager() { | |
| return SessionHolder.getRootSessionManager(); | |
| } |
YongGoose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
changes/en-us/2.x.mdandchanges/zh-cn/2.x.md.Summary
When a Transaction Manager (TM) client disconnects unexpectedly in a microservice environment, global transactions remain orphaned in "Begin" status until timeout (typically 30+ seconds). This causes unnecessary resource locks, blocking other transactions and degrading system performance during failure scenarios.
This PR introduces fail-fast transaction cleanup by detecting TM disconnections and immediately rolling back orphaned transactions, reducing resource blocking from 30+ seconds to under 1 second.
Problem
In production microservice deployments, TM clients can disconnect due to:
When this occurs, the current behavior is problematic:
Current State: Global transactions started by the disconnected TM remain in "Begin" status, holding:
User Impact: A 5-second network partition results in 30+ seconds of resource blocking, multiplied across all in-flight transactions. In high-throughput systems processing hundreds of transactions per second, this amplifies temporary failures into prolonged availability issues. Downstream services experience cascading performance degradation as they wait for locks to be released.
Root Cause: Seata's transaction lifecycle is designed to react to explicit TM commands (commit/rollback). There is no proactive cleanup mechanism that responds to TM connection state—the only cleanup path is passive timeout waiting.
Solution
Introduced connection-aware transaction lifecycle management that treats TM disconnection as a transaction lifecycle event, not just a network event.
The approach enables fail-fast cleanup through:
1. Event-Driven Detection
Hook into existing Netty channel lifecycle to detect TM disconnections in real-time, rather than discovering orphaned transactions only during timeout sweeps.
2. Conservative Matching Strategy
Two-level identification prevents false positives:
Only transactions in "Begin" status are candidates—transactions already in terminal or transitional states (Committing, Rollbacking, etc.) are never touched.
3. Opt-In Safety Model
Disabled by default (
server.enableRollbackWhenDisconnect=false) to ensure zero impact on existing deployments. Users must explicitly enable after validating their transaction patterns.4. Standard Rollback Semantics
Reuses the existing timeout rollback path, transitioning transactions to "TimeoutRollbacking" status before invoking standard cleanup logic. This ensures consistency with normal failure handling and preserves audit trails.
Impact
What Changes for Users
No Breaking Changes
New Capability When Enabled
Users gain proactive failure recovery:
Configuration
Closes #4422