You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -23,10 +23,9 @@ To allow configuration of a "retry budget" across all endpoints of a destination
23
23
* To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec.
24
24
* To allow specification of more than one retry budget for a given service, or for specific subsets of its traffic.
25
25
26
-
27
26
## Introduction
28
27
29
-
Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/).
28
+
Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request load that is comprised of retries from across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/).
30
29
31
30
Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures.
32
31
@@ -76,7 +75,7 @@ Configuring a retry budget through a Policy Attachment may produce some confusio
76
75
77
76
Discrepancies in the semantics of retry budget behavior and configuration options between Envoy and Linkerd may require a change in either implementation to accommodate the Gateway API specification. While Envoy's `min_retry_concurrency` setting may behave similarly in practice to Linkerd's `minRetriesPerSecond`, they are not directly equivalent.
78
77
79
-
A version of Linkerd's `ttl` parameter may also need to be implemented within Envoy.
78
+
The implementation of a version of Linkerd's `ttl` parameter within Envoy might be a path towards reconciling the behavior of these implementations, as it could allow Envoy to express a `budget_percent` and minimum number of permissible retries over a period of time rather than by tracking active and pending connections. It is not currently clear which of these models is preferable, but being able to specify a budget as requests over a window of time seems like it might offer more predictable behavior.
80
79
81
80
## API
82
81
@@ -102,11 +101,14 @@ TODO
102
101
103
102
## Other considerations
104
103
105
-
TODO
104
+
* Is it worth allowing the budget to be expressed as a `Fraction` similar to `HTTPRequestMirrorFilter` as described in [GEP-3171](https://gateway-api.sigs.k8s.io/geps/gep-3171/), or is a percentage sufficient for this use case? (Expressing a sub-1% budget for retries seems less necessary than for mirroring or redirecting traffic at significant scale.)
105
+
* As there isn't anything inherently specific to HTTP requests in either known implementation, a retry budget policy on a target Service could likely be applicable to GRPCRoute as well as HTTPRoute requests.
106
+
* While retry budgets are commonly associated with service mesh uses cases to handle many distributed clients, a retry budget policy may also be desirable for north/south implementations of Gateway API to prioritize new inbound requests and minimize tail latency during periods of service instability.
0 commit comments