Skip to content

Conversation

@kssumin
Copy link

@kssumin kssumin commented Sep 29, 2025

  • I have read the CONTRIBUTING.md guidelines.
  • I have registered the PR changes in changes/en-us/2.x.md and changes/zh-cn/2.x.md.

Summary

When a Transaction Manager (TM) client disconnects unexpectedly in a microservice environment, global transactions remain orphaned in "Begin" status until timeout (typically 30+ seconds). This causes unnecessary resource locks, blocking other transactions and degrading system performance during failure scenarios.

This PR introduces fail-fast transaction cleanup by detecting TM disconnections and immediately rolling back orphaned transactions, reducing resource blocking from 30+ seconds to under 1 second.

Problem

In production microservice deployments, TM clients can disconnect due to:

  • Network partitions or instability
  • Container restarts in Kubernetes environments
  • Process crashes or OOM kills
  • Service scaling operations

When this occurs, the current behavior is problematic:

Current State: Global transactions started by the disconnected TM remain in "Begin" status, holding:

  • Database row locks and table locks
  • Connection pool resources
  • Memory on TC server for session state

User Impact: A 5-second network partition results in 30+ seconds of resource blocking, multiplied across all in-flight transactions. In high-throughput systems processing hundreds of transactions per second, this amplifies temporary failures into prolonged availability issues. Downstream services experience cascading performance degradation as they wait for locks to be released.

Root Cause: Seata's transaction lifecycle is designed to react to explicit TM commands (commit/rollback). There is no proactive cleanup mechanism that responds to TM connection state—the only cleanup path is passive timeout waiting.

Solution

Introduced connection-aware transaction lifecycle management that treats TM disconnection as a transaction lifecycle event, not just a network event.

The approach enables fail-fast cleanup through:

1. Event-Driven Detection
Hook into existing Netty channel lifecycle to detect TM disconnections in real-time, rather than discovering orphaned transactions only during timeout sweeps.

2. Conservative Matching Strategy
Two-level identification prevents false positives:

  • Primary: Transaction Service Group (VGroup) matching—Seata's standard logical grouping that maps services to TC clusters
  • Secondary: Application ID validation when available provides additional confidence

Only transactions in "Begin" status are candidates—transactions already in terminal or transitional states (Committing, Rollbacking, etc.) are never touched.

3. Opt-In Safety Model
Disabled by default (server.enableRollbackWhenDisconnect=false) to ensure zero impact on existing deployments. Users must explicitly enable after validating their transaction patterns.

4. Standard Rollback Semantics
Reuses the existing timeout rollback path, transitioning transactions to "TimeoutRollbacking" status before invoking standard cleanup logic. This ensures consistency with normal failure handling and preserves audit trails.

Impact

What Changes for Users

No Breaking Changes

  • Existing behavior is preserved when disabled (default state)
  • No client-side changes required
  • No protocol or wire format changes
  • Can be safely deployed without configuration changes

New Capability When Enabled
Users gain proactive failure recovery:

  • Resource Reclamation: Locks released in seconds instead of waiting for 30+ second timeouts
  • Fault Isolation: TM failures don't cascade through lock contention to other services
  • Operational Visibility: Explicit logging of early rollbacks enables monitoring and alerting on TM health

Configuration

# Server-side only, no client changes needed
server.enableRollbackWhenDisconnect=false  # Default: disabled

Closes #4422

…onnects

- TMDisconnectHandler interface for handling TM disconnect events
- DefaultTMDisconnectHandler implementation with VGroup-based matching
- Configuration option: server.enableRollbackWhenDisconnect (default: false)
- Integration with AbstractNettyRemotingServer for TM disconnect detection
- Comprehensive test coverage with unit and integration tests
@kssumin kssumin force-pushed the feature/early-rollback-tm-disconnect-4422 branch from c599475 to f772c72 Compare September 29, 2025 08:51
@YongGoose YongGoose requested review from YongGoose and Copilot October 9, 2025 06:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements early rollback of global transactions when Transaction Manager (TM) disconnects to improve system performance by reducing resource lock duration from 60 seconds to <1 second.

Key Changes:

  • Added TM disconnect detection and handling infrastructure
  • Implemented VGroup-based transaction matching for early rollback
  • Added configuration toggle for the feature (disabled by default)

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TMDisconnectHandler.java Interface for handling TM disconnect events
DefaultTMDisconnectHandler.java Implementation with VGroup/ApplicationId matching logic
AbstractNettyRemotingServer.java Integration of TM disconnect detection in server
DefaultCoordinator.java Initialization of TM disconnect handler
ConfigurationKeys.java & DefaultValues.java Configuration constants for feature toggle
Test files Comprehensive unit and integration tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


} catch (TransactionException e) {
LOGGER.error(
"Failed to rollback transaction [{}] {} {}",
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message format is unclear with three consecutive placeholders without descriptive text. Consider a more descriptive format like 'Failed to rollback transaction [{}]: code={}, message={}'

Suggested change
"Failed to rollback transaction [{}] {} {}",
"Failed to rollback transaction [{}]: code={}, message={}",

Copilot uses AI. Check for mistakes.
*
* @return the session manager
*/
public SessionManager getSessionManager() {
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is exposed solely for testing but exists in production code. Consider using package-private visibility instead of public, or use dependency injection to make the code more testable.

Suggested change
public SessionManager getSessionManager() {
SessionManager getSessionManager() {

Copilot uses AI. Check for mistakes.
@YongGoose YongGoose requested a review from funky-eyes October 9, 2025 06:10
@YongGoose YongGoose linked an issue Oct 9, 2025 that may be closed by this pull request
@YongGoose
Copy link
Member

@kssumin

Where did you handle the channelInactive event?

@YvCeung YvCeung added TM Relate to seata tm module/server server module module/core core module labels Oct 16, 2025
@kssumin kssumin closed this Oct 18, 2025
@kssumin kssumin reopened this Oct 18, 2025
@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

❌ Patch coverage is 84.48276% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.76%. Comparing base (d1ae495) to head (f772c72).
⚠️ Report is 23 commits behind head on 2.x.

Files with missing lines Patch % Lines
...ta/core/rpc/netty/AbstractNettyRemotingServer.java 28.57% 5 Missing ⚠️
...server/coordinator/DefaultTMDisconnectHandler.java 91.48% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x    #7674      +/-   ##
============================================
+ Coverage     61.05%   61.76%   +0.70%     
- Complexity      670      684      +14     
============================================
  Files          1316     1325       +9     
  Lines         49804    50104     +300     
  Branches       5855     5917      +62     
============================================
+ Hits          30407    30945     +538     
+ Misses        16689    16373     -316     
- Partials       2708     2786      +78     
Files with missing lines Coverage Δ
...ava/org/apache/seata/common/ConfigurationKeys.java 0.00% <ø> (ø)
...in/java/org/apache/seata/common/DefaultValues.java 100.00% <ø> (ø)
...e/seata/server/coordinator/DefaultCoordinator.java 49.64% <100.00%> (+0.48%) ⬆️
...server/coordinator/DefaultTMDisconnectHandler.java 91.48% <91.48%> (ø)
...ta/core/rpc/netty/AbstractNettyRemotingServer.java 14.63% <28.57%> (+0.84%) ⬆️

... and 85 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kssumin
Copy link
Author

kssumin commented Nov 14, 2025

@kssumin

Where did you handle the channelInactive event?

@YongGoose
The channelInactive event is handled in the existing handleDisconnect() method. Location

Flow:

  • Netty fires channelInactive → calls existing handleDisconnect() → we added TM check → calls our handler

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +132
/**
* Get the session manager instance. Made public for testing purposes.
*
* @return the session manager
*/
public SessionManager getSessionManager() {
return SessionHolder.getRootSessionManager();
}
Copy link

Copilot AI Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getSessionManager() method appears to be unused. The comment states it's "Made public for testing purposes," but it's not used in any test files (DefaultTMDisconnectHandlerTest.java or TMDisconnectIntegrationTest.java). The method simply delegates to SessionHolder.getRootSessionManager(), which tests can call directly. Consider removing this method to reduce unnecessary public API surface.

Suggested change
/**
* Get the session manager instance. Made public for testing purposes.
*
* @return the session manager
*/
public SessionManager getSessionManager() {
return SessionHolder.getRootSessionManager();
}

Copilot uses AI. Check for mistakes.
Copy link
Member

@YongGoose YongGoose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module/core core module module/server server module TM Relate to seata tm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rollback of global transactions ahead of time

4 participants