Skip to content

Conversation

@zhijun42
Copy link
Contributor

Follow-up to the previous reproduction PR #21050. Refer there for full context.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhijun42
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @zhijun42. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

/ok-to-test

@serathius
Copy link
Member

Have you read the original motivation for returning NoLeader? #7630 #7275

Could we add a test for WithRequireLeader?

Maybe we could ask @heyitsanthony @xiang90

@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.39%. Comparing base (c492848) to head (d899fa7).

Files with missing lines Patch % Lines
server/etcdserver/v3_server.go 81.81% 2 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/api/v3rpc/lease.go 76.62% <ø> (-5.66%) ⬇️
server/etcdserver/api/v3rpc/util.go 67.74% <ø> (ø)
server/lease/leasehttp/http.go 62.16% <100.00%> (ø)
server/etcdserver/v3_server.go 75.66% <81.81%> (+0.12%) ⬆️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21122      +/-   ##
==========================================
- Coverage   68.40%   68.39%   -0.02%     
==========================================
  Files         429      429              
  Lines       35242    35248       +6     
==========================================
  Hits        24109    24109              
- Misses       9737     9746       +9     
+ Partials     1396     1393       -3     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c492848...d899fa7. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhijun42
Copy link
Contributor Author

@serathius I think this is the story: In the original PR, when the server stream is canceled due to no leader, the author converts the error code from Canceled to NoLeader because he incorrectly assumed that the error code from stream is Canceled, which is in fact not true - it's already NoLeader.

We can prove this by commenting out these lines on main branch and will see that relevant tests still pass.

func (ls *LeaseServer) LeaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) (err error) {
	errc := make(chan error, 1)
	go func() {
	    errc <- ls.leaseKeepAlive(stream)
	}()
	select {
	case err = <-errc:
	case <-stream.Context().Done():
        err = stream.Context().Err()
        //if errors.Is(err, context.Canceled) {
        //	err = rpctypes.ErrGRPCNoLeader
        //}
	}
	return err
}

As for WithRequireLeader, it's already covered by tests TestLeaseWithRequireLeader and TestV3LeaseRequireLeader in that original PR.

But yeah you make a good point ,so I do add a new test case almost identical to TestV3LeaseRequireLeader (except that I don't use WithRequireLeader) to showcase that we will always receive NoLeader error regardless of WithRequireLeader.

@zhijun42
Copy link
Contributor Author

On a separate yet related note, I find an opportunity for improvement. If a lease client sets this withRequireLeader and closes its receive channel due to no leader, it should also stop sending more keepAlive requests to the server (it wouldn't have a channel to receive response anyway). Opened a new PR #21126 for this.

}, 3*time.Second, 100*time.Millisecond)
})

// This is almost identical to TestV3LeaseRequireLeader, except that it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a comment saying those tests are identical, could you just migrate TestV3LeaseRequireLeader to be a test case here? Having it as separate PR will help showing the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Created a separate PR #21127

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. Now that my test function has 5 test cases and only 2 of them require gofail. I refactor a bit to move the make gofail-enable check from function level into test case level.

@zhijun42
Copy link
Contributor Author

Note about grpcproxy behavior: During testing, I found that the grpcproxy doesn't propagate NoLeader errors when WithRequireLeader is not used. The keepAliveLoop in server/proxy/grpcproxy/lease.go (lines 263-265) discards errors and only calls cancel(), resulting in context.Canceled instead of the actual server error.

Not sure if this is intentional design or an actual bug, but since this code has been stable since 2017, I decided not to touch it. Instead, I modified my test cases to handle such difference.

if len(gofail.List()) == 0 {
t.Skip("please run 'make gofail-enable' before running the test")
}
integration.GofailSkipCheck(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move such refactor to separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to PR #21136

Comment on lines +100 to +104
// We end up here due to:
// 1. Client cancellation
// 2. Server cancellation: the client ctx is wrapped with WithRequireLeader,
// monitorLeader() detects no leader and thus cancels this stream with ErrGRPCNoLeader.
// 3. Server cancellation: the server is shutting down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful comment. Most other are not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep raising my standard for comments going forward!

Comment on lines 391 to 392
// This is best-effort until the context times out. For now all
// error codes are considered retryable except ErrLeaseNotFound.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it and other bad comments.

@zhijun42 zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from c9f280e to d899fa7 Compare January 16, 2026 00:33
@zhijun42
Copy link
Contributor Author

Rebased. Great idea moving the SkipGoFail and additional test case into separate PRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants