Skip to content

Conversation

@jizhuozhi
Copy link

@jizhuozhi jizhuozhi commented Jul 20, 2025

Description

This PR introduces metadata-based node filtering for consul discovery, supporting Consul service discovery based upstreams

Motivation

Currently, APISIX selects upstream nodes based on service name from discovery without additional filtering logic. In real-world scenarios like canary release or swimlane routing, users often tag backend instances with custom metadata (e.g., version, env, lane, dc, and etc) and expect the gateway to route only to specific subsets.

This change allows users to define a metadata_match field in discovery_args configuration, which filters nodes before load balancing based on their metadata values.

Changes

  • Consul discovery:
  • Include Service.Meta in the node definition and respect its weight if available (aligned with Eureka).
  • Filter with discovery_args when fetch upstream nodes.
  • Tests: Add test cases to cover both:

    • discovery-based upstream (Consul) with metadata_match

Example Usage

upstream:
  type: roundrobin
  scheme: http
  discovery_type: eureka
  discovery_args:
    metadata_match:
      lane:
      - prod
      - canary
      dc:
      - us-east-1
      - us-east-2

Only nodes with metadata.lane in [prod, canary] and metadata.dc in [us-east-1, us-east-2] will be used for load balancing.


Fixes

Fixes #12464


Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (no breaking changes)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 20, 2025
@Baoyuantop
Copy link
Contributor

Hi @jizhuozhi, thanks for your contribution.

I think it is useful for node filtering of Consul discovery. But I don't understand static upstream filtering. It seems that I need to mark metadata for each node in the upstream object, and then use metadata_match to configure filtering? Because each node is manually defined, if it is not needed, can I just add or delete the node?

@jizhuozhi
Copy link
Author

Hi @jizhuozhi, thanks for your contribution.

I think it is useful for node filtering of Consul discovery. But I don't understand static upstream filtering. It seems that I need to mark metadata for each node in the upstream object, and then use metadata_match to configure filtering? Because each node is manually defined, if it is not needed, can I just add or delete the node?

Here is a unified approach: I only determine whether there is a filtering rule, without distinguishing whether it is a service discovery or static list.

However, according to my previous experience as a gateway administrator, there will be corresponding business developers who temporarily add rules for some debugging considerations but do not want to change the original instance list (for quick adjustment)

@Baoyuantop
Copy link
Contributor

there will be corresponding business developers who temporarily add rules for some debugging considerations but do not want to change the original instance list

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

@moonming moonming requested a review from Copilot July 21, 2025 08:31

This comment was marked as outdated.

@jizhuozhi
Copy link
Author

jizhuozhi commented Jul 21, 2025

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

It is not a production environment, but it is common in the testing and verification phase. We need to specify specific instances frequently (for example, to capture flame graphs for performance analysis), but we need to add other instances back after deleting them, so we need to specify instances by filtering.

In fact, we also matched according to the dynamic colored metadata when loading balancing but not predefine the routes, similar to https://github.com/kitex-contrib/loadbalance-tagging (I am also using lua to implement the same capabilities, but this is not within the scope of this discussion).

@Baoyuantop
Copy link
Contributor

Baoyuantop commented Jul 22, 2025

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

It is not a production environment, but it is common in the testing and verification phase. We need to specify specific instances frequently (for example, to capture flame graphs for performance analysis), but we need to add other instances back after deleting them, so we need to specify instances by filtering.

In fact, we also matched according to the dynamic colored metadata when loading balancing but not predefine the routes, similar to https://github.com/kitex-contrib/loadbalance-tagging (I am also using lua to implement the same capabilities, but this is not within the scope of this discussion).

I still have doubts about what is in Example Usage. Do you mean that if I need to adjust the nodes used, I don't need to change the content of the nodes list, but adjust metadata_match?
By the way, are you using static nodes or service discovery?

@jizhuozhi
Copy link
Author

jizhuozhi commented Jul 22, 2025

I still have doubts about what is in Example Usage. Do you mean that if I need to adjust the nodes used, I don't need to change the content of the nodes list, but adjust metadata_match?

Yes, just adjust metadata_match (but the discussion of this use case has been separated from this PR). For the runtime, it is a unified filtering rule for the service list that does not need to distinguish the source.

By the way, are you using static nodes or service discovery?

We are currently using Consul on kubernetes. When I was working in another company a few years ago, we were using cloud virtual machines (or EC2). The cloud platform did not provide an API interface, but we used scripts to synchronize static instance lists at regular intervals. At this time, the static list was also a kind of dynamic discovery. (why not filter in the script? Because we were lazy:)

@Baoyuantop
Copy link
Contributor

Hi @jizhuozhi, There is currently no modification to the upstream schema, which means that the current modifications in the upstream only serve consul. Is it more appropriate to put all these logics into the consul module?

@jizhuozhi
Copy link
Author

jizhuozhi commented Jul 25, 2025

Hello, @Baoyuantop, thanks for your reply.

Hi @jizhuozhi, There is currently no modification to the upstream schema, which means that the current modifications in the upstream only serve consul. Is it more appropriate to put all these logics into the consul module?

Not only consul, but also Eureka (which has already supported metadata in apisix) will inherit this function. In my forked dashboard has already supported configuring metadata_match for Consul and Eureka
https://github.com/jizhuozhi/apisix-dashboard/blob/9bd72c82e4fcbfa0d2bf34420280028c2ca853c8/web/src/components/Upstream/components/ServiceDiscovery.tsx#L30-L37

(The examples in the PR description are just examples, because this allows testing without the registry, and we don't need to care about service discovery or static nodes.)

And the current discovery package is responsible for pulling all instances, and the filtering in discovery is effective for all service names and upstreams, it means that I can only configure general filtering rules, but cannot configure differentiated matching for different routes and upstreams. This is our current online effect

image

@jizhuozhi
Copy link
Author

Hello @Baoyuantop , I see. Currently, upstream has passed the discovery args to nodes, so the loop can be closed in discovery. I will modify it.

        local new_nodes, err = dis.nodes(up_conf.service_name, up_conf.discovery_args)
        if not new_nodes then
            return HTTP_CODE_UPSTREAM_UNAVAILABLE, "no valid upstream node: " .. (err or "nil")
        end

@jizhuozhi jizhuozhi changed the title feat(upstream): filter nodes in upstream with metadata feat(consul): filter nodes in upstream with metadata Jul 25, 2025
@jizhuozhi
Copy link
Author

Hello @Baoyuantop , PTAL, thanks :)

We also have Spring Cloud applications with Eureka, but I have no time to write test case now, so I will create a new PR for Eureka later.

@Baoyuantop
Copy link
Contributor

Hi @jizhuozhi, we are still discussing whether to accept the feature of this PR, and we need to reach a consensus before we can start the review. Since there is no separate issue to discuss this issue, you need to clearly tell the maintainer what this feature does and why it is needed in the PR description (the current description already exists).

The examples in the PR description are just examples, because this allows testing without the registry, and we don't need to care about service discovery or static nodes.

This is inappropriate and you need to replace it with an example from a real scenario. The current example will confuse other maintainers. In the latest changes, I see that you have cancelled the upstream related code. The current PR seems to focus on the filtering of consul services. Please update the PR description to reflect this. Thanks again for your contribution.

@jizhuozhi
Copy link
Author

This is inappropriate and you need to replace it with an example from a real scenario. The current example will confuse other maintainers.

Thank you for your reminder, the PR content has been updated

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 27, 2025
@Baoyuantop
Copy link
Contributor

Please fix the failed CI.

GET /t
--- response_body
{"service_a":[{"host":"127.0.0.1","port":30511,"weight":1}],"service_b":[{"host":"127.0.0.1","port":8002,"weight":1}]}
{"service_a":[{"host":"127.0.0.1","metadata":{"service_a_version":"4.0"},"port":30511,"weight":1}],"service_b":[{"host":"127.0.0.1","metadata":{"service_b_version":"4.1"},"port":8002,"weight":1}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it affect this place? I don't think this test should be changed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are now relying on metadata, we need to persist it when persisting.

Copy link
Member

@SkyeYoung SkyeYoung Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jizhuozhi Ok. Another question is, if this is added, do we now lack a test case for the situation where there is "no metadata"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it :)

@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Aug 1, 2025
SkyeYoung
SkyeYoung previously approved these changes Aug 14, 2025
Copy link
Member

@SkyeYoung SkyeYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SkyeYoung SkyeYoung requested a review from membphis August 14, 2025 06:30
Copy link
Member

@SkyeYoung SkyeYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, it seems that some explanations are missing, such as which tests are conducted without metadata. Including these comments would better assist others in reviewing your code. Thx.

@SkyeYoung SkyeYoung self-requested a review August 18, 2025 06:50
@jizhuozhi
Copy link
Author

I'm sorry, due to continuous business trips and illness, there has been no progress. I will continue to work on it next week.

@jizhuozhi
Copy link
Author

Hello @Baoyuantop , the test case t/plugin/mcp-bridge.t seems not effected by this PR, maybe need retry the CI, thanks :)

@Baoyuantop
Copy link
Contributor

The main branch should have corresponding optimizations, you can merge it.

@Baoyuantop
Copy link
Contributor

Hello @Baoyuantop , the test case t/plugin/mcp-bridge.t seems not effected by this PR, maybe need retry the CI, thanks :)

The main branch has fixed this bug.

@jizhuozhi
Copy link
Author

I will see the reason and fix the testcases

@silence8013
Copy link

This feature is fantastic, I can't wait!

Comment on lines 23 to 24
local function do_metadata(node, metadata)
local metadata = node.metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the second parameter, metadata?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the "rename metadata_match to metadata" commit partially overlaps with this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will fix that caused by batch renaming variables

@Baoyuantop
Copy link
Contributor

Hi @jizhuozhi, following up on the previous review comments. Let us know if you have an update. Thanks!

@Baoyuantop Baoyuantop added wait for update wait for the author's response in this issue/PR and removed user responded labels Dec 24, 2025
@jizhuozhi
Copy link
Author

I apologize that due to business demands, KPI pressure, and year-end performance reviews, my response to this request was delayed in Q4. I can now dedicate my full attention to this PR. I apologize again for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files. user responded wait for update wait for the author's response in this issue/PR

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

feat: I hope we could filter nodes via metadata when config upstream

5 participants