Skip to content

GEP 3388 Retry Budget API Design #3573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 140 additions & 5 deletions geps/gep-3388/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GEP-3388: Retry Budgets

* Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388)
* Status: Provisional
* Status: Implementable

(See status definitions [here](/geps/overview/#gep-states).)

Expand Down Expand Up @@ -29,7 +29,7 @@ Multiple data plane proxies offer optional configuration for budgeted retries, i

Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures.

While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification.
While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budgets should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification.

### Background on implementations

Expand All @@ -49,7 +49,9 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo

The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service.

Linkerd uses [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) as the default configuration to specify retries to a service, but - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - supports counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload.
(Note that budgeted retries have become less commonly used since Linkerd added support for counted retries in [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5): ServiceProfile operates at the level of a backend workload, meaning that it cannot configure anything at the level of a route, but counted retries can be configured using annotations on Service, HTTPRoute, and GRPCRoute.)

For both counted retries and budgeted retries, the actual retry logic is implemented by the `linkerd2-proxy` making the request on behalf on an application workload. The receiving proxy is not aware of the retry configuration at all.

Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. The `retryBudget` field of the ServiceProfile spec can be configured with the following optional parameters:

Expand Down Expand Up @@ -81,11 +83,144 @@ The implementation of a version of Linkerd's `ttl` parameter within Envoy might

### Go

TODO
```golang
type BackendTrafficPolicy struct {
// BackendTrafficPolicy defines the configuration for how traffic to a target backend should be handled.
//
// Support: Extended
//
// +optional
// <gateway:experimental>
//
// Note: there is no Override or Default policy configuration.

metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

// Spec defines the desired state of BackendTrafficPolicy.
Spec BackendTrafficPolicySpec `json:"spec"`

// Status defines the current state of BackendTrafficPolicy.
Status PolicyStatus `json:"status,omitempty"`
}

type BackendTrafficPolicySpec struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly prefer this option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100

// TargetRef identifies an API object to apply policy to.
// Currently, Backends (i.e. Service, ServiceImport, or any
// implementation-specific backendRef) are the only valid API
// target references.
// +listType=map
// +listMapKey=group
// +listMapKey=kind
// +listMapKey=name
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=16
TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`

// Retry defines the configuration for when to retry a request to a target backend.
//
// Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
// TCP failure) if a retry stanza is configured.
Comment on lines +122 to +123
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll likely need to add some guidance in the spec for how this overlaps with the other retry config already in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this might've just been an inadvertent copy-paste, was talking with @kflynn earlier about how we should add guidance on interaction with HTTPRoute retry clause - my initial thought is a retry budget is a constraint, but wouldn't on its own imply requests should be retried - thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, as a retry budget implies an overall constraint, and additionally application developers will decide which routes should require retries based on idempotency of requests, etc.

//
// Support: Extended
//
// +optional
// <gateway:experimental>
Retry *CommonRetryPolicy `json:"retry,omitempty"`

// SessionPersistence defines and configures session persistence
// for the backend.
//
// Support: Extended
//
// +optional
SessionPersistence *SessionPersistence `json:"sessionPersistence,omitempty"`
}

// CommonRetryPolicy defines the configuration for when to retry a request.
//
type CommonRetryPolicy struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the minimum viable set of fields here for an implementation to say that they support retry budgets?

// Support: Extended
//
// +optional
BudgetPercent *Int `json:"budgetPercent,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save this for the follow up, but we'll need some more comprehensive validation here, at least a min and max


// Support: Extended
//
// +optional
BudgetInterval *Duration `json:"budgetInterval,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this would be the equivalent to Linkerd's ttl parameter, I'm just wondering if calling it BudgetInterval would imply something slightly different. It might sound like we are sampling the number of total active requests on a set "interval", as opposed to that active requests have a specific "ttl" before they are no longer considered in the retry budget calculation. Maybe this distinction will be less important with the comments fleshed out. Thoughts @mikemorris?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my initial interpretation was that ttl was actually a bit of a misnomer - "time to live" is typically used for expiration times in caches, certificates, etc, whereas how I (possibly mistakenly) understood this to be used is as the divisor in a "temporal rate" to specify an interval or period of time over which a count is measured.

cc @kflynn who may be able to help clarify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ttl is actually specifying the window over which Linkerd calculates the retry rate. I'm not sure why it was ttl, I'd probably call it retryWindow or something.

(timeForDoingMath is tempting, but probably less ideal. 😂)


// Support: Extended
//
// +optional
MinRetryRate *RequestRate `json:"minRetryRate,omitempty"`
}

// RequestRate expresses a rate of requests over a given period of time.
//
type RequestRate struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both of the fields in here meant to be optional? Surely at least one of them needs to be set? Can they both be set at the same time?

Copy link
Contributor

@mikemorris mikemorris Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both would likely be set at the same time. I'm not sure if there's an obvious enough default for either field on its own, so maybe both should be required, and what should actually be optional and could have a reasonable default is just the MinRetryRate field of this type in CommonRetryPolicy?

// Support: Extended
//
// +optional
Count *Int `json:"count,omitempty"`

// Support: Extended
//
// +optional
Interval *Duration `json:"interval,omitempty"`
}

// Duration is a string value representing a duration in time. The format is
// as specified in GEP-2257, a strict subset of the syntax parsed by Golang
// time.ParseDuration.
//
// +kubebuilder:validation:Pattern=`^([0-9]{1,5}(h|m|s|ms)){1,4}$`
type Duration string

### YAML

TODO
```yaml
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: traffic-policy-example
spec:
targetRefs:
- group: ""
kind: Service
name: foo
retry:
budgetPercent: 20
budgetInterval: 10s
minRetryRate:
count: 3
interval: 1s
sessionPersistence:
...
status:
ancestors:
- ancestorRef:
kind: Mesh
namespace: istio-system
name: istio
controllerName: "istio.io/mesh-controller"
conditions:
- type: "Accepted"
status: "False"
reason: "Invalid"
message: "BackendTrafficPolicy field sessionPersistence is not supported for Istio mesh traffic."
- ancestorRef:
kind: Gateway
namespace: foo-ns
name: foo-ingress
controllerName: "istio.io/mesh-controller"
conditions:
- type: "Accepted"
status: "False"
reason: "Invalid"
message: "BackendTrafficPolicy fields retry.budgetPercentage, retry.budgetInterval and retry.minRetryRate are not supported for Istio ingress gateways."
...
```

## Conformance Details

Expand Down
2 changes: 1 addition & 1 deletion geps/gep-3388/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1
kind: GEPDetails
number: 3388
name: Retry Budgets
status: Provisional
status: Implementable
# Any authors who contribute to the GEP in any way should be listed here using
# their Github handle.
authors:
Expand Down