GEP 3388 Retry Budget API Design (#3573)

ericdbishop · mikemorris · LiorLieberman · web-flow · commit b86c8eae8ef6 · 2025-02-07T17:09:55.000-08:00
* GEP-3388 HTTP Retry Budget API Design

* add first draft of YAML for retry budgets

* add Go structs for retry budget API design proposal

* Apply suggestions from code review

Co-authored-by: Mike Morris &lt;mikemorris@users.noreply.github.com&gt;
Co-authored-by: Lior Lieberman &lt;liorlib7+riskified@gmail.com&gt;

* remove HTTPRouteRetry from CommonRetryPolicy

* remove From scoping, deferred to future memorandum GEP as a potential common pattern for policy attachment

* remove standalone RetryPolicy examples in favor of BackendTrafficPolicy

* Update geps/gep-3388/index.md

Co-authored-by: Flynn &lt;kflynn@users.noreply.github.com&gt;

* Update geps/gep-3388/index.md

Co-authored-by: Flynn &lt;kflynn@users.noreply.github.com&gt;

---------

Co-authored-by: Mike Morris &lt;1149913+mikemorris@users.noreply.github.com&gt;
Co-authored-by: Mike Morris &lt;mikemorris@users.noreply.github.com&gt;
Co-authored-by: Lior Lieberman &lt;liorlib7+riskified@gmail.com&gt;
Co-authored-by: Flynn &lt;kflynn@users.noreply.github.com&gt;
diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md
@@ -1,7 +1,7 @@
 # GEP-3388: Retry Budgets
 
 * Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388)
-* Status: Provisional
+* Status: Implementable
 
 (See status definitions [here](/geps/overview/#gep-states).)
 
@@ -29,7 +29,7 @@ Multiple data plane proxies offer optional configuration for budgeted retries, i
 
 Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures.
 
-While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. 
+While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budgets should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification.
 
 ### Background on implementations
 
@@ -49,7 +49,9 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo
 
 The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service.
 
-Linkerd uses [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) as the default configuration to specify retries to a service, but - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - supports counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload.
+(Note that budgeted retries have become less commonly used since Linkerd added support for counted retries in [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5): ServiceProfile operates at the level of a backend workload, meaning that it cannot configure anything at the level of a route, but counted retries can be configured using annotations on Service, HTTPRoute, and GRPCRoute.)
+
+For both counted retries and budgeted retries, the actual retry logic is implemented by the `linkerd2-proxy` making the request on behalf on an application workload. The receiving proxy is not aware of the retry configuration at all.
 
 Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. The `retryBudget` field of the ServiceProfile spec can be configured with the following optional parameters:
 
@@ -81,11 +83,144 @@ The implementation of a version of Linkerd's `ttl` parameter within Envoy might
 
 ### Go
 
-TODO
+```golang
+type BackendTrafficPolicy struct {
+    // BackendTrafficPolicy defines the configuration for how traffic to a target backend should be handled.
+    //
+    // Support: Extended
+    //
+    // +optional
+    // <gateway:experimental>
+    //
+    // Note: there is no Override or Default policy configuration.
+
+    metav1.TypeMeta   `json:",inline"`
+    metav1.ObjectMeta `json:"metadata,omitempty"`
+
+    // Spec defines the desired state of BackendTrafficPolicy.
+    Spec BackendTrafficPolicySpec `json:"spec"`
+    
+    // Status defines the current state of BackendTrafficPolicy.
+    Status PolicyStatus `json:"status,omitempty"`
+}
+
+type BackendTrafficPolicySpec struct {
+  // TargetRef identifies an API object to apply policy to.
+  // Currently, Backends (i.e. Service, ServiceImport, or any
+  // implementation-specific backendRef) are the only valid API
+  // target references.
+  // +listType=map
+  // +listMapKey=group
+  // +listMapKey=kind
+  // +listMapKey=name
+  // +kubebuilder:validation:MinItems=1
+  // +kubebuilder:validation:MaxItems=16
+  TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`
+
+  // Retry defines the configuration for when to retry a request to a target backend.
+  //
+  // Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
+  // TCP failure) if a retry stanza is configured.
+  //
+  // Support: Extended
+  //
+  // +optional
+  // <gateway:experimental>
+  Retry *CommonRetryPolicy `json:"retry,omitempty"`
+
+  // SessionPersistence defines and configures session persistence
+  // for the backend.
+  //
+  // Support: Extended
+  //
+  // +optional
+  SessionPersistence *SessionPersistence `json:"sessionPersistence,omitempty"`
+}
+
+// CommonRetryPolicy defines the configuration for when to retry a request.
+//
+type CommonRetryPolicy struct {
+    // Support: Extended
+    //
+    // +optional
+    BudgetPercent *Int `json:"budgetPercent,omitempty"`
+
+    // Support: Extended
+    //
+    // +optional
+    BudgetInterval *Duration `json:"budgetInterval,omitempty"`
+
+    // Support: Extended
+    //
+    // +optional
+    MinRetryRate *RequestRate `json:"minRetryRate,omitempty"`
+}
+
+// RequestRate expresses a rate of requests over a given period of time.
+//
+type RequestRate struct {
+    // Support: Extended
+    //
+    // +optional
+    Count *Int `json:"count,omitempty"`
+
+    // Support: Extended
+    //
+    // +optional
+    Interval *Duration `json:"interval,omitempty"`
+}
+
+// Duration is a string value representing a duration in time. The format is
+// as specified in GEP-2257, a strict subset of the syntax parsed by Golang
+// time.ParseDuration.
+//
+// +kubebuilder:validation:Pattern=`^([0-9]{1,5}(h|m|s|ms)){1,4}$`
+type Duration string
 
 ### YAML
 
-TODO
+```yaml
+apiVersion: gateway.networking.x-k8s.io/v1alpha1
+kind: BackendTrafficPolicy
+metadata:
+  name: traffic-policy-example
+spec:
+  targetRefs:
+    - group: ""
+      kind: Service
+      name: foo
+  retry:
+    budgetPercent: 20
+    budgetInterval: 10s
+    minRetryRate:
+      count: 3
+      interval: 1s
+  sessionPersistence:
+    ...
+  status:
+    ancestors:
+    - ancestorRef:
+        kind: Mesh
+        namespace: istio-system
+        name: istio
+      controllerName: "istio.io/mesh-controller"
+      conditions:
+      - type: "Accepted"
+        status: "False"
+        reason: "Invalid"
+        message: "BackendTrafficPolicy field sessionPersistence is not supported for Istio mesh traffic."
+    - ancestorRef:
+        kind: Gateway
+        namespace: foo-ns
+        name: foo-ingress
+      controllerName: "istio.io/mesh-controller"
+      conditions:
+      - type: "Accepted"
+        status: "False"
+        reason: "Invalid"
+        message: "BackendTrafficPolicy fields retry.budgetPercentage, retry.budgetInterval and retry.minRetryRate are not supported for Istio ingress gateways."
+    ...
+```
 
 ## Conformance Details
 
diff --git a/geps/gep-3388/metadata.yaml b/geps/gep-3388/metadata.yaml
@@ -2,7 +2,7 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1
 kind: GEPDetails
 number: 3388
 name: Retry Budgets
-status: Provisional
+status: Implementable
 # Any authors who contribute to the GEP in any way should be listed here using
 # their Github handle.
 authors: