-
Notifications
You must be signed in to change notification settings - Fork 564
API for Default Gateways #3887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
API for Default Gateways #3887
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kflynn The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
#### 2. Controlling which Gateway is the Default | ||
|
||
Since Chihiro must be able to control which Gateway is the default, selecting | ||
the default Gateway must be an active configuration step taken by Chihiro, | ||
rather than any kind of implicit behavior. To that end, the Gateway resource | ||
will gain a new field, `spec.isDefault`: | ||
|
||
```go | ||
type GatewaySpec struct { | ||
// ... other fields ... | ||
IsDefault *bool `json:"isDefault,omitempty"` | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly what I expected, I agree that there's no real way to do this other than a mechanism like this.
However, this option does not handle what happens when there are multiple Gateway-reconciling implementations in the cluster, each with a separate GatewayClass.
Does each GatewayClass have a separate default Gateway? Does each Route with no parentRefs
get claimed by each GatewayClass?
Or is there also a mechanism for choosing a default GatewayClass? If so, we'd need to define similar rules for that.
tbh I am not sure which option is better here, although I think I slightly lean towards letting each GatewayClass have its own default, since each implementation will probably have a separate datapath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had roughly similar thoughts. Is the scope of the default Gateway is only per implementation? If so, why not defining the default Gateway in the GatewayClass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to comments above. I'd suggest that each Gateway implementation can implement their own limitations on how many default Gateways they'll support, with a baseline expectation that that will be at most one per GatewayClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started writing the API with the perspective that we would want to take the conservative route of supporting a single default Gateway per cluster, and changed my stance very very reluctantly as I gamed out the operational aspect of needing to change the default Gateway while the cluster is running, and the implementation aspect of enforcing such a limitation.
At this point, I do not consider it practical to limit the number of default Gateways at either the Gateway level or at the GatewayClass level. So, for example, if you have three Gateways marked as default, then if Ana creates a defaulted Route, it will bind to all three Gateways, irrespective of which GatewayClass the Gateways have.
I ended up here for a few reasons:
- If we enforce a single default Gateway, Chihiro cannot change the default Gateway without either incurring downtime, or having no way to make sure the new default Gateway will work. Enforcing one default per GatewayClass would seem to run into pretty much exactly the same set of implementation issues as enforcing one per cluster, and it doesn't actually improve the operational situation since it means that Chihiro only gets a safe way to do zero-downtime default Gateway swaps by involving a second GatewayClass, which is to say a second implementation that may or may not behave the same as the one they want to be using -- and I definitely don't think we know enough to declare that Chihiro will never want to swap to a different default Gateway in the same GatewayClass.
- Explaining one-per-GatewayClass to users of Gateway API is much more complex than explaining that any Gateway marked as a default will claim a defaulted Route. It's simply easier to reason about things without thinking about which GatewayClass owns a which Gateway.
- Finally, what we do really gain with a limit, per-cluster or otherwise? I was initially thinking of one per cluster because it sounds like it's the easiest to reason about... but really, it's not, because it introduces some ugly implementation details in an eventually-consistent world, and in turn those implementation details are likely to result in fragile enforcement that will leak. Ultimately, I don't think we gain anything: we shouldn't be imposing that kind of operational policy on Chihiro.
I'll see if I can express this better in the GEP. Always happy to hear suggestions for how, of course. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed reasoning. I think you have mostly convinced me.
Few more questions/concerns that comes with multiple default gateways:
-
Whats the intersection of default Gateways and Namespaces? Is the route going to be registered to Gateways in the same namespace AND gateways that allow refs from a different namespaces?
-
Assuming an org structure of default gateway per namespace. (while some gateways, allow refs from other namespaces for a reason). When Ana registers another route, she believes it is registered in her namespace but apparently it also impacts gateways in different namespaces. Probably it is expected, but would be useful a) document that and b) make sure Ana understands the implication of what she did?
Sidenote: I think this is a case where templating/dev independence tooling/ are a better fit for default gateways. (worth calling it out in alternatives to the whole GEP)
- [perhaps not as significant concern] If we have X Default Gateways, Ana, when she registered her route to the default gateway(s), she has no idea where her route is registered. This can increase route conflicts and could potentially cause unexpected behavior (from Ana's perspective) as precedence rules are utilized more heavily.
#### 1. Binding a Route to the Default Gateway | ||
|
||
For Ana to indicate that a Route should use the default Gateway, she MUST | ||
leave `parentRefs` empty in the `spec` of the Route, for example: | ||
|
||
```yaml | ||
apiVersion: gateway.networking.k8s.io/v1 | ||
kind: HTTPRoute | ||
metadata: | ||
name: my-route | ||
spec: | ||
rules: | ||
- backendRefs: | ||
- name: my-service | ||
port: 80 | ||
``` | ||
|
||
would route _all_ HTTP traffic arriving at the default Gateway to `my-service` | ||
on port 80. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is really the only way to do this, but I think it may be important to be clear about what Ana is trading away here - the main thing is certainty. This defaulting behavior means that, if Ana does this, she is handing over control of where her app is advertised to Ian and Chihiro. It should absolutely not be used for anything that has specific security requirements, for example.
I say this, because I feel confident that at least someone will get screwed by that at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, fascinating to read this and realize that I've been considering that as "obvious" -- so, yes, I'll add that, it's a great point to call out.
Reluctantly, we must therefore conclude that option 1 is the only viable | ||
choice. Therefore: Gateways MUST NOT attempt to enforce a single default | ||
Gateway, and MUST allow Routes with no `parentRefs` to bind to _all_ Gateways | ||
that have `spec.isDefault` set to `true`. This is simplest to implement, it | ||
permits zero-downtime changes to the default Gateway, and it allows for | ||
testing of the new default Gateway before the old one is deleted. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also doesn't change the security profile, since Ana is abrogating any responsibility for where her app is advertised by using this functionality.
|
||
#### 1. Binding a Route to the Default Gateway | ||
|
||
For Ana to indicate that a Route should use the default Gateway, she MUST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I believe Ana lacks a method to verify the presence of a default gateway being deployed in the cluster or to check if there is at least one listener on the Gateway permitting attachmentent of her Route.
Consider a scenario where Ana creates a Route and expects the default gateway to be attached, but no status condition is propagated. How should she troubleshoot this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the same problem as if Ana creates a route with a parentRef
naming a Gateway that doesn't exist: Ana will create the route, no status
will show up, Ana will probably give it another minute or so, it still won't show up, and at that point Ana will (probably) start doublechecking if she typo'd the name of the Gateway, or picked the wrong Gateway by accident, or looking to see what Gateways she sees deployed, or calling Chihiro to ask what's going on, etc.
This isn't to say that it's a great situation for Ana, it's just an already existing situation. This is actually one of the reasons I tend to argue that Ana should always have RBAC to see which Gateways are deployed: it can really help Ana to be able to see the Gateways, although in any given situation, yeah, she might actually need to talk to Chihiro about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also important to note that Ana may not be the one checking this at all. See: #3385
Ana may be simply handing Chihiro their manifests in a "this is how you deploy, and expose my application" sort of manner, and knowledge of the Gateways on the cluster is irrelevant from their perspective.
In my mind, this is one of the key reasons to do default gateways.
Mesh traffic is defined by using a Service as a `parentRef` rather than a | ||
Gateway. As such, there is no case where a default Gateway would be used for | ||
mesh traffic. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when parentRef
is not empty due to needing to attach the route to a service? Would it violate the default-gateway requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth being explicit in here, probably, but my expectation is that, if you are specifying parentRefs for mesh, you're specifying parentRefs, and so you have to specify all parentRefs.
It's not going to be possible to do defaulting for GAMMA parentRefs because we don't have a clean object to put that setting in, I think. So, if you want a HTTPRoute to do both, you have to pick both the Service and the Gateway to attach to. That seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngnick did a fine job of summarizing my thought process here. I'll amplify this here, although it's discussed in the N/S section:
Note also that if Ana specifies any
parentRefs
, the default Gateway MUST
NOT claim the Route unless of theparentRefs
explicitly names the default
Gateway. To do otherwise makes it impossible for Ana to define mesh-only
Routes, or to specify a Route that is meant to use only a specific Gateway
that is not the default. This implies that for Ana to specify a Route intended
to serve both north/south and east/west roles, she MUST explicitly specify the
Gateway inparentRefs
, even if that Gateway happens to be the default
Gateway.
(ugh. s/unless of the/unless one of the/ in the second line...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to revisit this though. We all believe default gateway is going to be useful, but at the same time, the vision for mesh is that you can use the same httproute for mesh. This increases the bar a little to onboard to GAMMA if you need to now mention all your gateways explicitly (of course this is a forward looking concern where users heavily rely on default gateways).
I understand the problem here and why you rely on empty parentRefs, but would like to think about how we can allow "default Gateway PLUS this service". perhaps this is another +1 to not rely on empty parentRef for this? xRef: #3887 (comment)
would route _all_ HTTP traffic arriving at the default Gateway to `my-service` | ||
on port 80. | ||
|
||
Note that Ana MUST omit `parentRefs` entirely: specifying an empty array for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it's going to lead to a lot of accidental exposures of a Route. This is a huge change. Currently the absence of parentRefs
means that a Route is guaranteed to be internal only, and now we're changing it to mean "attached to any Gateway that claims to be default."
I would much rather have a way to intentionally attach to default Gateways. Maybe that's an entry in ParentRefs that is just kind: DefaultGateway
, or maybe that's an entirely separate field which is attachToDefaultGateways: True
. IMO, that would result in much clearer and more predictable behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems likely that this discussion will continue while I'm out on vacation; FTR I am okay with something like kind: DefaultGateway
as a pseudo type more than I am with a separate field in the spec
. But either would be okay if we have to do something more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in the Gateway API meeting today, I'd love to get more opinions here. I tend to feel that Ana is unlikely to be using Routes with no parentRef
s because they don't anything, and as such I don't like things like
parentRefs:
- kind: DefaultGateway
because it becomes just extra lines to type.
HOWEVER, I had also thought that a Route with no parentRefs
would fail validation, and testing it this morning, nope, you can absolutely create such Routes. So I'm quite curious what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I thought about this some more, and I have a few thoughts why I think that adding a DefaultGateway
pseudo-Kind is probably better:
- Because you can currently add a route with no
parentRefs
, the current proposal is a behavior change with no API change. As @robscott says, this runs a high risk of screwing Ana over. If she has created HTTPRoutes and expected them not to be advertised yet, then Chihiro flips thedefault
switch totrue
on a Gatewyay, those HTTPRoutes will now be exposed with no action on Ana's part. - Not having any definite API change also makes this way harder to write an Extended feature for (since it seems reasonable that not every implementation will want to do this). Adding a
DefaultGateway
pseudo-Kind means that implementations can fail processing of thatparentRoute
and do the current behavior (which is to ignore the HTTPRoute because ownership reasons). - Using a pseudo-Kind also allows Ana to have a Route that binds to a GAMMA service and "whatever the default Gateway is", which doesn't seem like the best idea to me, but may be something that some Anas will want.
- Since one of the primary uses for this defaulting config is write-once-read-many use cases like adding HTTPRoutes into Helm charts, having two extra lines is a small cost for the writer, while giving significant benefits (as above) to the reader and user (that is, Ana).
So yes, on reflection, I think I vastly prefer the kind: DefaultGateway
solution.
With that said, I'm out on vacation in the next few hours until the end of the GEP refinement phase, so I should also be clear: I am not blocking this either way. I just think kind: DefaultGateway
is a better design. But I would be reluctantly okay with the current "empty parentRefs" approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would "pseudo-API" means? like no real API just a kind name? +1 if yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if instead of having an API you just had a convention?
# Use the cluster's default gateway
parentRefs:
- namespace: default
name: default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, would you block users from actually deploy such a gateway in the default namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"users" would not have permissions in the default
namespace. (If they did, then they could replace the "kubernetes" service...) Only a cluster-admin ought to be able to create a Gateway there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if instead of having an API you just had a convention?
# Use the cluster's default gateway parentRefs: - namespace: default name: default
I think we'd have to avoid using the default namespace here and instead use a namespace that's unlikely to have been used before, but that seems like an easily solvable problem. The convention based approach is definitely the least disruptive here as it means no API changes are needed.
With that said, I'd be worried that this kind of approach would make it impossible to have more than one default gateway at a time, which would be annoying in the case that an admin wanted to migrate from one default gateway to another for a cluster. If for example a user was using GatewayClass foo as default and wanted to move to a Gateway of class bar as default, they'd have to either manually switching everything over that specified the default Gateway, or do a disruptive replacement of the default Gateway.
If the relationship was a bit more abstract (ie attachToDefaultGateways
), you could have more than one default Gateway active at a time, and gradually swap DNS/whatever over after you'd verified that the new Gateway was meeting your needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An idea that came to mind, what if an alternative to parentRefs
was implemented as a selector combination (namespace and label).
A cluster administrator could set up (in the future) a MutatingAdmissionPolicy (or equivalent mutation/policy) that looked at Gateways being created and, if they had no parent refs, could default the namespace/label selector to any value they like.
Perhaps they choose some label that has a key/value that implies default, it wouldn't really matter.
This would allow them to create many different gateways that match their selector, and over time adjust that set.
In theory it would allow other use cases of selecting the attachment in more dynamic ways too, not sure if that's something that had come up?
/cc |
Signed-off-by: Flynn <[email protected]>
Signed-off-by: Flynn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I have a few comments.
The main thing that I need to think about is semantically how we seem to be pushing for "Default Gateway" meaning "one Gateway, which is the default", whereas I wonder if we should be looking at it more like "by default, this Gateway will accept unbound routes given a scope".
Do we really want DefaultGateway
or do we actually want AcceptRoutesWithNoParentRefsWithinScope<Scope>: true
?
I made some comments about this, and I'm interested in hearing your perspective 🤔
|
||
#### 1. Binding a Route to the Default Gateway | ||
|
||
For Ana to indicate that a Route should use the default Gateway, she MUST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also important to note that Ana may not be the one checking this at all. See: #3385
Ana may be simply handing Chihiro their manifests in a "this is how you deploy, and expose my application" sort of manner, and knowledge of the Gateways on the cluster is irrelevant from their perspective.
In my mind, this is one of the key reasons to do default gateways.
would route _all_ HTTP traffic arriving at the default Gateway to `my-service` | ||
on port 80. | ||
|
||
Note that Ana MUST omit `parentRefs` entirely: specifying an empty array for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first read, a DefaultGateway
has some appeal. This insinuates a new API type, which I think we would need to think through carefully amidst the advent of ListenerSets
, as I expect the desire to have a ListenerSet
be bound by default when no ParentRefs
are present is real.
Support for multiple default Gateways in a cluster is not one of the original | ||
goals of this GEP. However, allowing Chihiro to control which Gateway is the | ||
default - including being able to switch which Gateway is the default at | ||
runtime, without requiring downtime - is a goal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think semantically throughout this GEP so far I'm reading the meaning of "Default Gateway" to mean "A single Gateway which routes will apply to by default".
However, I'm wondering if what a "Default Gateway" should more definitively be positioned as is: "a configuration on a Gateway that indicates that it will automatically accept routes within a certain scope by default". Under that definition, I think we would make multiple "Default Gateways" a goal.
🤔 curious as to your thoughts.
- We could define the default Gateway as a Gateway with a magic name, e.g. | ||
"default". This doesn't actually make things that much simpler for Ana | ||
(she'd still have to specify `parentRefs`), and it raises questions about | ||
Chihiro's ability to control which Routes can bind to the default Gateway, | ||
as well as how namespacing would work -- it's especially unhelpful for Ana | ||
if she has to know the namespace of the default Gateway in order to use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this sentiment, but I think it's worth noting here that we once discussed using API defaults to reduce the burden on Ana.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other point against a change like this is that it would be a breaking change for anyone who already had a Gateway using whichever magic name we decided on here.
github won't let me comment on the pre-existing parts of the GEP, but:
|
on port 80. | ||
|
||
Note that Ana MUST omit `parentRefs` entirely: specifying an empty array for | ||
`parentRefs` MUST fail validation. If a Route with an empty array for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you know the difference between omitted and an empty array? I don't think you can know that by the time you get to the Go type.
I assume this is just implemented as a minimum length requirement on the openapi validation on the CRD?
AFAICT this isn't on the schema today so is potentially a breaking change. Have you checked if validation ratcheting would cover this case? Does the entire window of supported K8s versions for GW API enable validation ratcheting by default?
To enumerate Routes bound to the default Gateway, Ana can look for Routes with | ||
no `parentRefs` specified, and then check the `status.parents` of those Routes | ||
to see if the Route has been claimed. This will also tell Ana which Gateway is | ||
the default, even if she doesn't have RBAC to query Gateway resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a form of privilege escalation?
Add API description to GEP-3793. I'm sure this'll be totally noncontroversial. 😇
/kind gep