-
Notifications
You must be signed in to change notification settings - Fork 553
GEP 3388 Retry Budget API Design #3573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
78667d6
c737a82
db8bded
c38b48d
797c731
5a76d18
c32298c
04c4b7b
dad86c3
7d9ef6e
1e2c618
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
# GEP-3388: Retry Budgets | ||
|
||
* Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388) | ||
* Status: Provisional | ||
* Status: Implementable | ||
|
||
(See status definitions [here](/geps/overview/#gep-states).) | ||
|
||
|
@@ -29,7 +29,7 @@ Multiple data plane proxies offer optional configuration for budgeted retries, i | |
|
||
Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. | ||
|
||
While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. | ||
While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budgets should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. | ||
|
||
### Background on implementations | ||
|
||
|
@@ -49,7 +49,9 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo | |
|
||
The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service. | ||
|
||
Linkerd uses [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) as the default configuration to specify retries to a service, but - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - supports counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. | ||
(Note that budgeted retries have become less commonly used since Linkerd added support for counted retries in [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5): ServiceProfile operates at the level of a backend workload, meaning that it cannot configure anything at the level of a route, but counted retries can be configured using annotations on Service, HTTPRoute, and GRPCRoute.) | ||
|
||
For both counted retries and budgeted retries, the actual retry logic is implemented by the `linkerd2-proxy` making the request on behalf on an application workload. The receiving proxy is not aware of the retry configuration at all. | ||
|
||
Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. The `retryBudget` field of the ServiceProfile spec can be configured with the following optional parameters: | ||
|
||
|
@@ -81,11 +83,144 @@ The implementation of a version of Linkerd's `ttl` parameter within Envoy might | |
|
||
### Go | ||
|
||
TODO | ||
```golang | ||
type BackendTrafficPolicy struct { | ||
// BackendTrafficPolicy defines the configuration for how traffic to a target backend should be handled. | ||
// | ||
// Support: Extended | ||
// | ||
// +optional | ||
// <gateway:experimental> | ||
// | ||
// Note: there is no Override or Default policy configuration. | ||
|
||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
// Spec defines the desired state of BackendTrafficPolicy. | ||
Spec BackendTrafficPolicySpec `json:"spec"` | ||
|
||
// Status defines the current state of BackendTrafficPolicy. | ||
Status PolicyStatus `json:"status,omitempty"` | ||
} | ||
|
||
type BackendTrafficPolicySpec struct { | ||
// TargetRef identifies an API object to apply policy to. | ||
// Currently, Backends (i.e. Service, ServiceImport, or any | ||
// implementation-specific backendRef) are the only valid API | ||
// target references. | ||
// +listType=map | ||
// +listMapKey=group | ||
// +listMapKey=kind | ||
// +listMapKey=name | ||
// +kubebuilder:validation:MinItems=1 | ||
// +kubebuilder:validation:MaxItems=16 | ||
TargetRefs []LocalPolicyTargetReference `json:"targetRefs"` | ||
|
||
// Retry defines the configuration for when to retry a request to a target backend. | ||
// | ||
// Implementations SHOULD retry on connection errors (disconnect, reset, timeout, | ||
// TCP failure) if a retry stanza is configured. | ||
Comment on lines
+122
to
+123
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll likely need to add some guidance in the spec for how this overlaps with the other retry config already in the API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh this might've just been an inadvertent copy-paste, was talking with @kflynn earlier about how we should add guidance on interaction with HTTPRoute retry clause - my initial thought is a retry budget is a constraint, but wouldn't on its own imply requests should be retried - thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, as a retry budget implies an overall constraint, and additionally application developers will decide which routes should require retries based on idempotency of requests, etc. |
||
// | ||
// Support: Extended | ||
// | ||
// +optional | ||
// <gateway:experimental> | ||
Retry *CommonRetryPolicy `json:"retry,omitempty"` | ||
|
||
// SessionPersistence defines and configures session persistence | ||
// for the backend. | ||
// | ||
// Support: Extended | ||
// | ||
// +optional | ||
SessionPersistence *SessionPersistence `json:"sessionPersistence,omitempty"` | ||
} | ||
|
||
// CommonRetryPolicy defines the configuration for when to retry a request. | ||
// | ||
type CommonRetryPolicy struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the minimum viable set of fields here for an implementation to say that they support retry budgets? |
||
// Support: Extended | ||
// | ||
// +optional | ||
BudgetPercent *Int `json:"budgetPercent,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can save this for the follow up, but we'll need some more comprehensive validation here, at least a min and max |
||
|
||
// Support: Extended | ||
// | ||
// +optional | ||
BudgetInterval *Duration `json:"budgetInterval,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this would be the equivalent to Linkerd's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, my initial interpretation was that cc @kflynn who may be able to help clarify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, ( |
||
|
||
// Support: Extended | ||
// | ||
// +optional | ||
MinRetryRate *RequestRate `json:"minRetryRate,omitempty"` | ||
} | ||
|
||
// RequestRate expresses a rate of requests over a given period of time. | ||
// | ||
type RequestRate struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are both of the fields in here meant to be optional? Surely at least one of them needs to be set? Can they both be set at the same time? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, both would likely be set at the same time. I'm not sure if there's an obvious enough default for either field on its own, so maybe both should be required, and what should actually be optional and could have a reasonable default is just the |
||
// Support: Extended | ||
// | ||
// +optional | ||
Count *Int `json:"count,omitempty"` | ||
|
||
// Support: Extended | ||
// | ||
// +optional | ||
Interval *Duration `json:"interval,omitempty"` | ||
} | ||
|
||
// Duration is a string value representing a duration in time. The format is | ||
// as specified in GEP-2257, a strict subset of the syntax parsed by Golang | ||
// time.ParseDuration. | ||
// | ||
// +kubebuilder:validation:Pattern=`^([0-9]{1,5}(h|m|s|ms)){1,4}$` | ||
type Duration string | ||
|
||
### YAML | ||
|
||
TODO | ||
```yaml | ||
apiVersion: gateway.networking.x-k8s.io/v1alpha1 | ||
kind: BackendTrafficPolicy | ||
metadata: | ||
name: traffic-policy-example | ||
spec: | ||
targetRefs: | ||
- group: "" | ||
kind: Service | ||
name: foo | ||
retry: | ||
budgetPercent: 20 | ||
budgetInterval: 10s | ||
minRetryRate: | ||
count: 3 | ||
interval: 1s | ||
sessionPersistence: | ||
... | ||
status: | ||
ancestors: | ||
- ancestorRef: | ||
kind: Mesh | ||
namespace: istio-system | ||
name: istio | ||
controllerName: "istio.io/mesh-controller" | ||
conditions: | ||
- type: "Accepted" | ||
status: "False" | ||
reason: "Invalid" | ||
message: "BackendTrafficPolicy field sessionPersistence is not supported for Istio mesh traffic." | ||
- ancestorRef: | ||
kind: Gateway | ||
namespace: foo-ns | ||
name: foo-ingress | ||
controllerName: "istio.io/mesh-controller" | ||
conditions: | ||
- type: "Accepted" | ||
status: "False" | ||
reason: "Invalid" | ||
message: "BackendTrafficPolicy fields retry.budgetPercentage, retry.budgetInterval and retry.minRetryRate are not supported for Istio ingress gateways." | ||
... | ||
``` | ||
|
||
## Conformance Details | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly prefer this option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+100