generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 555
gep: add GEP-3388 HTTP Retry Budget #3488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
k8s-ci-robot
merged 13 commits into
kubernetes-sigs:main
from
ericdbishop:gep-3388-retry-budget
Jan 28, 2025
Merged
Changes from 8 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
2241fe5
gep: add GEP-3388 HTTPRoute Retry Budget
ericdbishop e27468c
Update metadata for gep-1731 & gep-3388
ericdbishop b9df3f4
Correcting information and readability
ericdbishop 538bb61
Improve background information and context; remove unused sub-sections
ericdbishop a5eaa93
Mark some sections as TODO
ericdbishop 979c8c9
add mkdocs
ericdbishop d7353c8
formatting
ericdbishop c18895c
Minor improvements
ericdbishop 80474d9
Minor improvements
ericdbishop 04fefd8
Detail comparison between Policy Attachment & HTTPRoute configuration…
ericdbishop 39b4b0f
Better compare retry budget options between Linkerd and Envoy impleme…
ericdbishop 1877e2f
Apply suggestions from code review
ericdbishop 9e01592
Remove consideration to express budget as a Fraction
ericdbishop File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# GEP-3388: HTTPRoute Retry Budget | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388) | ||
* Status: Provisional | ||
|
||
(See status definitions [here](/geps/overview/#gep-states).) | ||
|
||
## TLDR | ||
|
||
To allow configuration of a "retry budget" in HTTPRoute, to determine when to prevent additional client-side retries, by limiting the percentage of the active request load that may consist of retries, across all endpoints of a destination service. | ||
|
||
## Goals | ||
|
||
* To allow specification of a retry ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to determine whether a request should be retried, and any shared configuration or interaction with configuration of a static retry limit within HTTPRoute. | ||
* To allow specification of a percentage of active requests, or recently active requests, that should be able to be retried concurrently. | ||
* To allow specification of a *minimum* number of retries that should be allowed per second or concurrently, such that the budget for retries never goes below this minimum value. | ||
* To define a standard for retry budgets that reconciles the known differences in current retry budget functionality between Gateway API data plane implementations. | ||
|
||
## Non-Goals | ||
|
||
* To allow specifying a default retry budget policy across a namespace or attached to a specific gateway. | ||
* To allow configuration of a back-off strategy or timeout window within the retry budget spec. | ||
* To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec. | ||
* To allow specification of more than one retry budget for a given service, for specific subsets of its traffic. | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Introduction | ||
|
||
Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget", reducing the risk of a high number of retries across clients, allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data plane proxies creates a challenge for a consensus on the correct location for the configuration. | ||
|
||
Envoy, for example, offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. | ||
|
||
### Background on implementations | ||
|
||
#### Envoy | ||
|
||
Supports configuring a [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold across a group of upstream endpoints, with the following parameters. | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* `min_retry_concurrency` Specifies the minimum retry concurrency allowed for the retry budget. The limit on the number of active retries may never go below this number. This parameter is optional. Defaults to 3. | ||
|
||
#### linkerd2-proxy | ||
|
||
Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. | ||
|
||
Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. | ||
|
||
## API | ||
|
||
### Go | ||
|
||
TODO | ||
|
||
### YAML | ||
|
||
TODO | ||
|
||
## Conformance Details | ||
|
||
TODO | ||
|
||
## Alternatives | ||
|
||
### Policy Attachment | ||
|
||
TODO | ||
|
||
## Other considerations | ||
|
||
TODO | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### What accommodations are needed for retry budget support? | ||
|
||
Changing the retry stanza to a Kubernetes "tagged union" pattern with something like `mode: "budget"` to support mutually-exclusive distinct sibling fields is possible as a non-breaking change if omitting the `mode` field defaults to the currently proposed behavior (which could retroactively become something like `mode: count`). | ||
|
||
## References | ||
|
||
* <https://gateway-api.sigs.k8s.io/geps/gep-1731/> | ||
ericdbishop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* <https://linkerd.io/2019/02/22/how-we-designed-retries-in-linkerd-2-2/> | ||
* <https://linkerd.io/2.11/tasks/configuring-retries/> | ||
* <https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds-retrybudget> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
apiVersion: internal.gateway.networking.k8s.io/v1alpha1 | ||
kind: GEPDetails | ||
number: 3388 | ||
name: HTTPRoute Retry Budget | ||
status: Provisional | ||
# Any authors who contribute to the GEP in any way should be listed here using | ||
# their Github handle. | ||
authors: | ||
- ericdbishop | ||
- mikemorris | ||
relationships: | ||
# obsoletes indicates that a GEP makes the linked GEP obsolete, and completely | ||
# replaces that GEP. The obsoleted GEP MUST have its obsoletedBy field | ||
# set back to this GEP, and MUST be moved to Declined. | ||
obsoletes: {} | ||
obsoletedBy: {} | ||
# extends indicates that a GEP extends the linkned GEP, adding more detail | ||
# or additional implementation. The extended GEP MUST have its extendedBy | ||
# field set back to this GEP. | ||
extends: | ||
- number: 1731 | ||
name: HTTPRoute Retries | ||
extendedBy: {} | ||
# seeAlso indicates other GEPs that are relevant in some way without being | ||
# covered by an existing relationship. | ||
seeAlso: {} | ||
# references is a list of hyperlinks to relevant external references. | ||
# It's intended to be used for storing Github discussions, Google docs, etc. | ||
references: {} | ||
# featureNames is a list of the feature names introduced by the GEP, if there | ||
# are any. This will allow us to track which feature was introduced by which GEP. | ||
featureNames: {} | ||
# changelog is a list of hyperlinks to PRs that make changes to the GEP, in | ||
# ascending date order. | ||
changelog: {} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.