Skip to content

GEP 3388 Retry Budget API Implementation #3607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
4b4fd67
apis: add implementation for GEP-3388 HTTPRoute Retry Budget
ericdbishop Feb 10, 2025
1993417
fmt and add descriptions for parameters
ericdbishop Feb 10, 2025
81ee318
Move GEP 3388 to Experimental
ericdbishop Feb 10, 2025
7b61c97
make generate
ericdbishop Feb 10, 2025
6661e47
Minor change
ericdbishop Feb 10, 2025
e359c3e
Require both parameters of RequestRate
ericdbishop Feb 10, 2025
b166a68
Begin fixing Retry description. Add defaults, some validation, in Com…
ericdbishop Feb 11, 2025
1a122fa
Taking the liberty of renaming CommonRetryPolicy to RetryConstraint
ericdbishop Feb 12, 2025
5a02c8e
Shamelessly copying from backendlbpolicy and backendtlspolicy to conf…
ericdbishop Feb 12, 2025
5f2b55b
Fleshing out the description for RetryConstraint
ericdbishop Feb 12, 2025
d6bcae5
refactor codegen scripts to make it easier to generate two clients
dprotaso Feb 3, 2025
6f04b8b
Attempting to match the experimental API structure that dprotaso made…
ericdbishop Feb 13, 2025
dcc5729
Delete files that were generated before moving to apisx
ericdbishop Feb 13, 2025
ededf82
undo commenting
ericdbishop Feb 13, 2025
1f5d276
merge main
ericdbishop Feb 27, 2025
1c2d826
Working to fix code gen following merge
ericdbishop Feb 27, 2025
3af4d86
Fix api group name
ericdbishop Feb 27, 2025
49d3e5c
capitalize RFC 2119 keyword
ericdbishop Feb 27, 2025
b1c899f
Addressing comments from code review
ericdbishop Feb 27, 2025
7837c0a
missing file
ericdbishop Feb 27, 2025
5b61365
merge main
ericdbishop Feb 27, 2025
d3741e2
fix imports
ericdbishop Feb 28, 2025
c624de3
Modify descriptions; add greater validation
ericdbishop Mar 2, 2025
7afef1e
Update apisx/v1alpha2/backendtrafficpolicy.go
ericdbishop Mar 3, 2025
c4910de
Not complete, but adding CEL tests for backend traffic policy
ericdbishop Mar 3, 2025
ad75b71
fix missing retryConstraint in test struct; add tests for invalid config
ericdbishop Mar 3, 2025
4d7bc3f
move to apis/v1alpha1
ericdbishop Mar 3, 2025
7c8fe59
add main_test
ericdbishop Mar 4, 2025
6fc4796
rename file
ericdbishop Mar 4, 2025
bc35cad
fix missing quotes :)
ericdbishop Mar 4, 2025
9a1f63b
fix CEL condition
ericdbishop Mar 4, 2025
46431b3
move validation message to interval field directly
ericdbishop Mar 4, 2025
a7a263b
remove unused apisx/v1alpha2 file
ericdbishop Mar 4, 2025
9a1d005
remove more files from before merging experimental api versions
ericdbishop Mar 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions apisx/v1alpha2/backendtrafficpolicy.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
/*
Copyright 2025 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package v1alpha2

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// +genclient
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:storageversion
// +kubebuilder:resource:categories=gateway-api,shortName=btrafficpolicy
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
//
// BackendTrafficPolicy is a Direct Attached Policy.
// +kubebuilder:metadata:labels="gateway.networking.k8s.io/policy=Direct"

// BackendTrafficPolicy defines the configuration for how traffic to a
// target backend should be handled.
type BackendTrafficPolicy struct {
// Support: Extended
//
// +optional
// <gateway:experimental>

metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

// Spec defines the desired state of BackendTrafficPolicy.
Spec BackendTrafficPolicySpec `json:"spec"`

// Status defines the current state of BackendTrafficPolicy.
Status PolicyStatus `json:"status,omitempty"`
}

// BackendTrafficPolicyList contains a list of BackendTrafficPolicies
// +kubebuilder:object:root=true
type BackendTrafficPolicyList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []BackendTrafficPolicy `json:"items"`
}

// BackendTrafficPolicySpec define the desired state of BackendTrafficPolicy
// Note: there is no Override or Default policy configuration.
type BackendTrafficPolicySpec struct {
// TargetRefs identifies API object(s) to apply this policy to.
// Currently, Backends (A grouping of like endpoints such as Service,
// ServiceImport, or any implementation-specific backendRef) are the only
// valid API target references.
//
// Currently, a TargetRef can not be scoped to a specific port on a
// Service.
//
// +listType=map
// +listMapKey=group
// +listMapKey=kind
// +listMapKey=name
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=16
TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`

// RetryConstraint defines the configuration for when to allow or prevent
// further retries to a target backend, by dynamically calculating a 'retry
// budget'. This budget is calculated based on the percentage of incoming
// traffic composed of retries over a given time interval. Once the budget
// is exceeded, additional retries will be rejected.
//
// For example, if the retry budget interval is 10 seconds, there have been
// 1000 active requests in the past 10 seconds, and the allowed percentage
// of requests that can be retried is 20% (the default), then 200 of those
// requests may be composed of retries. Active requests will only be
// considered for the duration of the interval when calculating the retry
// budget. Retrying the same original request multiple times within the
// retry budget interval will lead to each retry being counted towards
// calculating the budget.
//
// Configuring a RetryConstraint in BackendTrafficPolicy is compatible with
// HTTPRoute Retry settings for each HTTPRouteRule that targets the same
// backend. While the HTTPRouteRule Retry stanza can specify whether a
// request will be retried, and the number of retry attempts each client
// may perform, RetryConstraint helps prevent cascading failures such as
// retry storms during periods of consistent failures.
//
// After the retry budget has been exceeded, additional retries to the
// backend MUST return a 503 response to the client.
//
// Additional configurations for defining a constraint on retries MAY be
// defined in the future.
//
// Support: Extended
//
// +optional
// <gateway:experimental>
RetryConstraint *RetryConstraint `json:"retry,omitempty"`

// SessionPersistence defines and configures session persistence
// for the backend.
//
// Support: Extended
//
// +optional
SessionPersistence *SessionPersistence `json:"sessionPersistence,omitempty"`
}

// RetryConstraint defines the configuration for when to retry a request.
type RetryConstraint struct {
// BudgetPercent defines the maximum percentage of active requests that may
// be made up of retries.
//
// Support: Extended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some more work on this, but I'd argue that we should have a concept of "this field MUST be supported if you support this feature" (retry budgets in this case).

/cc @youngnick @mlavacca @shaneutt

//
// +optional
// +kubebuilder:default=20
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=100
BudgetPercent *int `json:"budgetPercent,omitempty"`

// BudgetInterval defines the duration in which requests will be considered
// for calculating the budget for retries.
//
// Support: Extended
//
// +optional
// +kubebuilder:default=10s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something we'll want to define a min and max value for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we were thinking that 0s could potentially be used as shorthand for the current behavior offered by Envoy. As far as a maximum I could pick some arbitrarily high value, but I'm not sure what would be appropriate. Thoughts @mikemorris?

Copy link
Contributor

@mikemorris mikemorris Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch @tonya11en Are y'all interested in enabling the existing "in-flight" Envoy functionality, or would a strict over-time measurement generally be preferable for most use cases? (Thinking if we actually just want to exclude 0s altogether)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer not to have magical 0s value and just skip interval if the implementation doesn't support that capability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @htuch (Hi, Harvey! 😂) that avoiding 0s being magical is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem unreasonable to support existing Envoy behavior to me and just say skip budget interval in that case.

Copy link
Contributor

@mikemorris mikemorris Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only tricky part with that would be that making interval an optional field makes it potentially easier to create invalid configurations for impls requiring an interval and not supporting "in-flight" - might need to add an additional conformance feature name for this case.

This is somewhat the inverse of how features would typically be implemented by using an optional field rather than omitting it (and having the optional field required for implementations not supporting the feature), not sure if that might cause any conformance difficulties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to change the kubebuilder annotations in any way in order for the Envoy implementation to be free to exclude budgetInterval? I don't need to remove the default, correct?

// +kubebuilder:validation:XValidation:message="budgetInterval can not be greater than one hour or less than one second",rule="!(duration(self.budgetInterval) < duration('1s') || duration(self.budgetInterval) > duration('1h'))"
BudgetInterval *Duration `json:"budgetInterval,omitempty"`

// MinRetryRate defines the minimum rate of retries that will be allowable
// over a specified duration of time.
//
// The effective overall minimum rate of retries targeting the backend
// service may be much higher, as there can be any number of clients which
// are applying this setting locally.
//
// This ensures that requests can still be retried during periods of low
// traffic, where the budget for retries may be calculated as a very low
// value.
//
// Support: Extended
//
// +optional
// +kubebuilder:default={count: 10, interval: 1s}
MinRetryRate *RequestRate `json:"minRetryRate,omitempty"`
}
24 changes: 24 additions & 0 deletions apisx/v1alpha2/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
/*
Copyright 2025 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

// Package v1alpha1 contains API Schema definitions for the gateway.networking.x-k8s.io
// API group.
//
// +k8s:openapi-gen=true
// +kubebuilder:object:generate=true
// +groupName=gateway.networking.x-k8s.io
// +groupGoName=Experimental
package v1alpha2
51 changes: 51 additions & 0 deletions apisx/v1alpha2/shared_types.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
/*
Copyright 2025 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package v1alpha2

import (
v1 "sigs.k8s.io/gateway-api/apis/v1"
v1alpha2 "sigs.k8s.io/gateway-api/apis/v1alpha2"
)

type (
// +k8s:deepcopy-gen=false
SessionPersistence = v1.SessionPersistence
// +k8s:deepcopy-gen=false
Duration = v1.Duration
// +k8s:deepcopy-gen=false
PolicyStatus = v1alpha2.PolicyStatus
// +k8s:deepcopy-gen=false
LocalPolicyTargetReference = v1alpha2.LocalPolicyTargetReference
)

// RequestRate expresses a rate of requests over a given period of time.
//
// +kubebuilder:validation:XValidation:message="interval can not be greater than one hour",rule="!(duration(self.interval) == duration('0s') || duration(self.interval) > duration('1h'))"
type RequestRate struct {
// Count specifies the number of requests per time interval.
//
// Support: Extended
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=1000000
Count *int `json:"count,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid unbounded values here, I'd recommend some kind of max here, even if that max is very high.

Copy link
Contributor

@mikemorris mikemorris Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context in which this is currently used is for MinRetryRate as "an override for very low volume traffic" as @htuch described in https://github.com/kubernetes-sigs/gateway-api/pull/3607/files#r1964340430 but this struct type is (intentionally) sufficiently generic that it could be re-used in the future for #326. As such, I would defer to @htuch on what a "safe" maximum might entail for this at scale with a potential denominator as short as 1ms (I'm assuming there may be integer type maximums to keep in mind for implementations too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is going to have a variable denominator, e.g. hour, I think you're just going to have to go with some type based limit or something that would make sense over a long period.


// Interval specifies the divisor of the rate of requests, the amount of
// time during which the given count of requests occur.
//
// Support: Extended
Interval *Duration `json:"interval,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also feels like something that should have a min and max value. I'm assuming 1s is acceptable for a min, unclear what a good max is. Any ideas @htuch or @mikemorris?

Copy link
Contributor

@mikemorris mikemorris Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually enforce min/max constraints on GEP-2257 Duration types?

If so, I think 1ms is the minimum expressible through that format (even if logically a per-second rate may be more common, expressing rates with a denominator in milliseconds may be helpful for higher-throughput use cases to work around a count max constraint), assuming we want to exclude the "divide by zero" shorthand for Envoy's current behavior.

The max expressible appears to be 99999h - in practice I'd expect even a rate per hour may be unlikely to be configured, instead preferring enforce a more normal distribution over longer time spans, but maybe 1h, 12h or 24h is a reasonable max?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think we'd have to define min+max with CEL using > and < comparisons. A rough example is here: https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/httproute_types.go#L320.

I agree that any of 1h, 12h, and 24h would be reasonable here, but I'd prefer to start with the lower value 1h since it would be much harder to tighten this validation retroactively than to loosen it.

Copy link
Contributor

@kflynn kflynn Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum GEP-2257 Duration is actually 99999h99999m99999s99999ms, which I would agree is quite a bit larger than would probably make sense here. I'm honestly not quite sure what kind of math we can do in CEL here, but a really simple check might be to disallow anything with more than two digits before h, or more than four before m or s?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately CEL supports a duration type that is compatible with our duration, so we should be able to have pretty reasonable comparisons here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
Loading