Skip to content

Commit b86c8ea

Browse files
ericdbishopmikemorrisLiorLiebermankflynn
authored
GEP 3388 Retry Budget API Design (#3573)
* GEP-3388 HTTP Retry Budget API Design * add first draft of YAML for retry budgets * add Go structs for retry budget API design proposal * Apply suggestions from code review Co-authored-by: Mike Morris <[email protected]> Co-authored-by: Lior Lieberman <[email protected]> * remove HTTPRouteRetry from CommonRetryPolicy * remove From scoping, deferred to future memorandum GEP as a potential common pattern for policy attachment * remove standalone RetryPolicy examples in favor of BackendTrafficPolicy * Update geps/gep-3388/index.md Co-authored-by: Flynn <[email protected]> * Update geps/gep-3388/index.md Co-authored-by: Flynn <[email protected]> --------- Co-authored-by: Mike Morris <[email protected]> Co-authored-by: Mike Morris <[email protected]> Co-authored-by: Lior Lieberman <[email protected]> Co-authored-by: Flynn <[email protected]>
1 parent d15acf9 commit b86c8ea

File tree

2 files changed

+141
-6
lines changed

2 files changed

+141
-6
lines changed

geps/gep-3388/index.md

Lines changed: 140 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# GEP-3388: Retry Budgets
22

33
* Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388)
4-
* Status: Provisional
4+
* Status: Implementable
55

66
(See status definitions [here](/geps/overview/#gep-states).)
77

@@ -29,7 +29,7 @@ Multiple data plane proxies offer optional configuration for budgeted retries, i
2929

3030
Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures.
3131

32-
While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification.
32+
While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budgets should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification.
3333

3434
### Background on implementations
3535

@@ -49,7 +49,9 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo
4949

5050
The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service.
5151

52-
Linkerd uses [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) as the default configuration to specify retries to a service, but - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - supports counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload.
52+
(Note that budgeted retries have become less commonly used since Linkerd added support for counted retries in [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5): ServiceProfile operates at the level of a backend workload, meaning that it cannot configure anything at the level of a route, but counted retries can be configured using annotations on Service, HTTPRoute, and GRPCRoute.)
53+
54+
For both counted retries and budgeted retries, the actual retry logic is implemented by the `linkerd2-proxy` making the request on behalf on an application workload. The receiving proxy is not aware of the retry configuration at all.
5355

5456
Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. The `retryBudget` field of the ServiceProfile spec can be configured with the following optional parameters:
5557

@@ -81,11 +83,144 @@ The implementation of a version of Linkerd's `ttl` parameter within Envoy might
8183

8284
### Go
8385

84-
TODO
86+
```golang
87+
type BackendTrafficPolicy struct {
88+
// BackendTrafficPolicy defines the configuration for how traffic to a target backend should be handled.
89+
//
90+
// Support: Extended
91+
//
92+
// +optional
93+
// <gateway:experimental>
94+
//
95+
// Note: there is no Override or Default policy configuration.
96+
97+
metav1.TypeMeta `json:",inline"`
98+
metav1.ObjectMeta `json:"metadata,omitempty"`
99+
100+
// Spec defines the desired state of BackendTrafficPolicy.
101+
Spec BackendTrafficPolicySpec `json:"spec"`
102+
103+
// Status defines the current state of BackendTrafficPolicy.
104+
Status PolicyStatus `json:"status,omitempty"`
105+
}
106+
107+
type BackendTrafficPolicySpec struct {
108+
// TargetRef identifies an API object to apply policy to.
109+
// Currently, Backends (i.e. Service, ServiceImport, or any
110+
// implementation-specific backendRef) are the only valid API
111+
// target references.
112+
// +listType=map
113+
// +listMapKey=group
114+
// +listMapKey=kind
115+
// +listMapKey=name
116+
// +kubebuilder:validation:MinItems=1
117+
// +kubebuilder:validation:MaxItems=16
118+
TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`
119+
120+
// Retry defines the configuration for when to retry a request to a target backend.
121+
//
122+
// Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
123+
// TCP failure) if a retry stanza is configured.
124+
//
125+
// Support: Extended
126+
//
127+
// +optional
128+
// <gateway:experimental>
129+
Retry *CommonRetryPolicy `json:"retry,omitempty"`
130+
131+
// SessionPersistence defines and configures session persistence
132+
// for the backend.
133+
//
134+
// Support: Extended
135+
//
136+
// +optional
137+
SessionPersistence *SessionPersistence `json:"sessionPersistence,omitempty"`
138+
}
139+
140+
// CommonRetryPolicy defines the configuration for when to retry a request.
141+
//
142+
type CommonRetryPolicy struct {
143+
// Support: Extended
144+
//
145+
// +optional
146+
BudgetPercent *Int `json:"budgetPercent,omitempty"`
147+
148+
// Support: Extended
149+
//
150+
// +optional
151+
BudgetInterval *Duration `json:"budgetInterval,omitempty"`
152+
153+
// Support: Extended
154+
//
155+
// +optional
156+
MinRetryRate *RequestRate `json:"minRetryRate,omitempty"`
157+
}
158+
159+
// RequestRate expresses a rate of requests over a given period of time.
160+
//
161+
type RequestRate struct {
162+
// Support: Extended
163+
//
164+
// +optional
165+
Count *Int `json:"count,omitempty"`
166+
167+
// Support: Extended
168+
//
169+
// +optional
170+
Interval *Duration `json:"interval,omitempty"`
171+
}
172+
173+
// Duration is a string value representing a duration in time. The format is
174+
// as specified in GEP-2257, a strict subset of the syntax parsed by Golang
175+
// time.ParseDuration.
176+
//
177+
// +kubebuilder:validation:Pattern=`^([0-9]{1,5}(h|m|s|ms)){1,4}$`
178+
type Duration string
85179

86180
### YAML
87181

88-
TODO
182+
```yaml
183+
apiVersion: gateway.networking.x-k8s.io/v1alpha1
184+
kind: BackendTrafficPolicy
185+
metadata:
186+
name: traffic-policy-example
187+
spec:
188+
targetRefs:
189+
- group: ""
190+
kind: Service
191+
name: foo
192+
retry:
193+
budgetPercent: 20
194+
budgetInterval: 10s
195+
minRetryRate:
196+
count: 3
197+
interval: 1s
198+
sessionPersistence:
199+
...
200+
status:
201+
ancestors:
202+
- ancestorRef:
203+
kind: Mesh
204+
namespace: istio-system
205+
name: istio
206+
controllerName: "istio.io/mesh-controller"
207+
conditions:
208+
- type: "Accepted"
209+
status: "False"
210+
reason: "Invalid"
211+
message: "BackendTrafficPolicy field sessionPersistence is not supported for Istio mesh traffic."
212+
- ancestorRef:
213+
kind: Gateway
214+
namespace: foo-ns
215+
name: foo-ingress
216+
controllerName: "istio.io/mesh-controller"
217+
conditions:
218+
- type: "Accepted"
219+
status: "False"
220+
reason: "Invalid"
221+
message: "BackendTrafficPolicy fields retry.budgetPercentage, retry.budgetInterval and retry.minRetryRate are not supported for Istio ingress gateways."
222+
...
223+
```
89224

90225
## Conformance Details
91226

geps/gep-3388/metadata.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1
22
kind: GEPDetails
33
number: 3388
44
name: Retry Budgets
5-
status: Provisional
5+
status: Implementable
66
# Any authors who contribute to the GEP in any way should be listed here using
77
# their Github handle.
88
authors:

0 commit comments

Comments
 (0)