-
Notifications
You must be signed in to change notification settings - Fork 10.3k
lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhijun42 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zhijun42. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
430f664 to
c1b8ad3
Compare
|
/ok-to-test |
|
Have you read the original motivation for returning NoLeader? #7630 #7275 Could we add a test for Maybe we could ask @heyitsanthony @xiang90 |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 21 files with indirect coverage changes @@ Coverage Diff @@
## main #21122 +/- ##
==========================================
- Coverage 68.40% 68.39% -0.02%
==========================================
Files 429 429
Lines 35242 35248 +6
==========================================
Hits 24109 24109
- Misses 9737 9746 +9
+ Partials 1396 1393 -3 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
@serathius I think this is the story: In the original PR, when the server stream is canceled due to no leader, the author converts the error code from Canceled to NoLeader because he incorrectly assumed that the error code from stream is Canceled, which is in fact not true - it's already NoLeader. We can prove this by commenting out these lines on main branch and will see that relevant tests still pass. func (ls *LeaseServer) LeaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) (err error) {
errc := make(chan error, 1)
go func() {
errc <- ls.leaseKeepAlive(stream)
}()
select {
case err = <-errc:
case <-stream.Context().Done():
err = stream.Context().Err()
//if errors.Is(err, context.Canceled) {
// err = rpctypes.ErrGRPCNoLeader
//}
}
return err
}As for But yeah you make a good point ,so I do add a new test case almost identical to |
|
On a separate yet related note, I find an opportunity for improvement. If a lease client sets this |
tests/integration/v3_lease_test.go
Outdated
| }, 3*time.Second, 100*time.Millisecond) | ||
| }) | ||
|
|
||
| // This is almost identical to TestV3LeaseRequireLeader, except that it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a comment saying those tests are identical, could you just migrate TestV3LeaseRequireLeader to be a test case here? Having it as separate PR will help showing the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Created a separate PR #21127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased. Now that my test function has 5 test cases and only 2 of them require gofail. I refactor a bit to move the make gofail-enable check from function level into test case level.
b74031b to
7ba114f
Compare
|
Note about grpcproxy behavior: During testing, I found that the grpcproxy doesn't propagate NoLeader errors when Not sure if this is intentional design or an actual bug, but since this code has been stable since 2017, I decided not to touch it. Instead, I modified my test cases to handle such difference. |
| if len(gofail.List()) == 0 { | ||
| t.Skip("please run 'make gofail-enable' before running the test") | ||
| } | ||
| integration.GofailSkipCheck(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move such refactor to separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to PR #21136
| // We end up here due to: | ||
| // 1. Client cancellation | ||
| // 2. Server cancellation: the client ctx is wrapped with WithRequireLeader, | ||
| // monitorLeader() detects no leader and thus cancels this stream with ErrGRPCNoLeader. | ||
| // 3. Server cancellation: the server is shutting down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful comment. Most other are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will keep raising my standard for comments going forward!
server/etcdserver/v3_server.go
Outdated
| // This is best-effort until the context times out. For now all | ||
| // error codes are considered retryable except ErrLeaseNotFound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it and other bad comments.
21dc2a9 to
c9f280e
Compare
Signed-off-by: Zhijun <[email protected]>
c9f280e to
d899fa7
Compare
|
Rebased. Great idea moving the SkipGoFail and additional test case into separate PRs! |
Follow-up to the previous reproduction PR #21050. Refer there for full context.