Skip to content

Extending support for rds_topology metadata handling #4992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: v3.0
Choose a base branch
from

Conversation

nielsvogell
Copy link

Context

ProxySql currently supports the discovery of AWS RDS Multi-AZ Cluster topologies by querying the mysql.rds_topology table on every nth (with n = monitor_aws_rds_topology_discovery_interval) read-only check. The rds_topology table contains information about cluster instances in the format

SELECT * FROM mysql.rds_topology;
+----+----------------------------------------------+------+
| id | endpoint                                     | port | 
+----+----------------------------------------------+------+
|  1 | <cluster-name>-instance-1...amazonaws.com    | 3306 |
|  2 | <cluster-name>-instance-2...amazonaws.com    | 3306 |
|  3 | <cluster-name>-instance-3...amazonaws.com    | 3306 |
+----+----------------------------------------------+------+

The same table is used for RDS databases in Blue/Green Deployments, where instead of cluster instances, it contains information about the blue and the green endpoints. Its schema is extended by three additional columns (role, status, and version), so that the data looks like this:

SELECT * FROM mysql.rds_topology;
+----+---------------------------------+------+------------------------------+-----------+---------+
| id | endpoint                        | port | role                         |  status   | version |
+----+---------------------------------+------+------------------------------+-----------+---------+
| 12 | <identifier>...com              | 3306 | BLUE_GREEN_DEPLOYMENT_SOURCE | AVAILABLE | 1.0     |
| 34 | <identifier>-green-<rand>...com | 3306 | BLUE_GREEN_DEPLOYMENT_TARGET | AVAILABLE | 1.0     |
+----+---------------------------------+------+------------------------------+-----------+---------+

Blue/Green Deployments

An RDS blue/green deployment is created for an existing (blue) RDS database (DB instance or cluster) and creates a read-only copy (green) of that database on a new server. The green environment may be modified (e.g., by performing an engine version upgrade). When ready, a switchover can be performed, which will promote the green database to a standalone and change its endpoint to the original (blue) server's name.

Changes in this PR

Note: this PR only contains the first commit towards a full feature. The remaining commits are in this working branch. I thought that reviewing them step by step might make the process easier.

Also note that currently this feature is by default disabled (monitor_aws_rds_topology_discovery_interval = 0), which is recommended until all required changes for the full feature are merged.

This PR adds support to reading the extended rds_topology table for both the READ_ONLY and the INNODB_READ_ONLY check. To preserve backwards compatibility, by default, the topology query remains the same and only when the discovered topology matches one found in blue/green deployments (even number of entries as opposed to the three found in Multi-AZ DB Clusters), do we switch the query to include the additional columns.

Highlevel change list:

  • Three new monitoring tasks (add ons to MON_INNDOB_READ_ONLY and MON_READ_ONLY)
  • additional check in process_discovered_topology to see if the query result matches the expectations and whether to switch to querying for blue/green deployment metadata
  • refactored processing of the discovered topology in preparation for handling the status and role columns (different to Multi-AZ DB Clusters, the server records from a blue/green deployment will need to be dynamically updated during a switchover)

How was this tested

  1. Regression check: added a Multi-AZ DB Cluster instance endpoint as a server into a hostgroup with read_only check. Verified that the other two servers are discovered as expected.
  2. New functionality: used an RDS MySQL instance in a blue/green deployment that has the mysql.rds_topology table with records as in the example above. Added the blue endpoint as a server into a hostgroup with read_only check enabled. Verified that the green endpoint was discovered and added to the runtime server list as expected.

Future steps

To make the feature complete, we nee to add a few more partial features. Roughly these are

  • Making sure that updates to servers happen for both reader and writer hostgroups (as these might change during a switchover and a server might appear in both)
  • Actively preventing traffic to go to green servers before the switchover
  • Temporarily increase the read_only polling frequency during a switchover to minimize downtime (allows to recognize when the green database is promoted, without waiting for DNS changes to take effect and be noticed by ProxySQL)
  • Force a DNS cache update after a switchover
  • Extend support to RDS Aurora cluster and custom endpoints

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @nielsvogell, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for supporting AWS RDS Blue/Green deployments within the monitor. It extends the existing rds_topology discovery mechanism to recognize the Blue/Green table format, switch to a more detailed query, and process the results, although full handling of roles and statuses is left for future commits. The feature is disabled by default.

Highlights

  • Blue/Green Deployment Topology Discovery: I've added initial support for discovering servers involved in AWS RDS Blue/Green deployments by querying the extended mysql.rds_topology table which includes role, status, and version columns.
  • Dynamic Query Switching: The monitor now dynamically determines whether the rds_topology table contains Multi-AZ cluster information (3 rows) or Blue/Green deployment information (even number of rows) and switches the query to fetch the appropriate columns on subsequent checks.
  • Refactored Topology Processing: The logic for processing discovered topology has been refactored to handle the different structures of the rds_topology table for Multi-AZ and Blue/Green deployments, preparing for future handling of roles and statuses.
  • New Monitor Task Types: Introduced new monitor task types (MON_READ_ONLY__AND__AWS_RDS_BLUE_GREEN_TOPOLOGY_DISCOVERY, etc.) to represent the different topology queries.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@renecannao
Copy link
Contributor

Can one of the admins verify this patch?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces foundational changes to support AWS RDS Blue/Green deployment topology discovery. It adds new monitoring task types, SQL queries for the extended mysql.rds_topology schema, and refactors the logic for processing discovered topologies. The changes appear to be a solid step towards the full feature, with good attention to detail in parsing and state management. Key improvements include more robust checks for Multi-AZ cluster identification and the use of unordered_set for better performance in process_discovered_topology.

A few areas to double-check include the updated logic for added_new_server in MySQL_HostGroups_Manager.cpp to ensure it aligns with the intended downstream effects, and the error logging in monitor_read_only_process_ready_tasks to avoid potentially misleading messages for non-topology tasks. The use of assert for runtime_mysql_servers should also be reviewed for production robustness.

@@ -3456,47 +3502,70 @@ void MySQL_Monitor::process_discovered_topology(const std::string& originating_s
char *query=(char *)"SELECT DISTINCT hostname FROM monitor_internal.mysql_servers ORDER BY hostname";
proxy_debug(PROXY_DEBUG_ADMIN, 4, "%s\n", query);
monitordb->execute_statement(query, &error, &cols, &affected_rows, &runtime_mysql_servers);
assert(runtime_mysql_servers);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert(runtime_mysql_servers) is good for debugging, but it becomes a no-op in release builds. If monitordb->execute_statement can return nullptr for runtime_mysql_servers without setting the error flag (e.g., if the table is empty), this could lead to a dereference of a null pointer when accessing runtime_mysql_servers->rows.

Consider adding an explicit null check for runtime_mysql_servers after the execute_statement call, in addition to checking the error flag, to prevent potential crashes in production if the function contract allows *resultset to be nullptr on success with no rows.

Suggested change
assert(runtime_mysql_servers);
if (error) {
proxy_error("Error on %s : %s\n", query, error);
} else {
if (runtime_mysql_servers == nullptr) {
proxy_error("runtime_mysql_servers is NULL but no error was returned from execute_statement. Possible memory corruption?\n", query);
return; // Or some other appropriate error handling
}
assert(runtime_mysql_servers);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur. assert is used in combination with MyHG->execute_query, which is where I got it from but it is actually not required here. Will remove that line.

@@ -1077,7 +1077,7 @@ class MySQL_HostGroups_Manager : public Base_HostGroups_Manager<MyHGC> {
void set_Readyset_status(char *hostname, int port, enum MySerStatus status);
unsigned long long Get_Memory_Stats();

void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, int>>& new_servers);
void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, long, int>>& new_servers);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of long to the tuple suggests that hostgroup IDs can now be larger than int. Ensure that all related code (e.g., database schema, other functions using hostgroup IDs) are updated to handle long values to prevent potential truncation or overflow issues.

void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, long, int>>& new_servers);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aligned the data type with the existing extraction. Reading further in the code, it seems that this long is actually converted to uint_32 implicitly right after. Did this have a reason in the original implementation? Happy to change both the new and the existing variable to use int or unsigned int instead.

Comment on lines +7375 to +7378
int res = create_new_server_in_hg(hostgroup_id, srv_info, srv_opts);
if (added_new_server < 0) {
added_new_server = res;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for added_new_server has changed. Previously, it would reflect the status of the last server addition attempt. Now, it reflects the status of the first attempt if it was successful (0), or the error code of the first failure if all prior attempts also failed.

For example:

  • If create_new_server_in_hg returns 0 for success, -1 for failure:
    • Old: S, S, F -> added_new_server = -1
    • New: S, S, F -> added_new_server = 0 (from the first S)
    • Old: F, S, S -> added_new_server = 0
    • New: F, S, S -> added_new_server = -1 (from the first F)

Confirm if this change in how added_new_server is determined aligns with the intended behavior. If the goal is to know if any server was successfully added, a boolean flag might be clearer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but wouldn't F, S, S result in added_new_server = 0?
First server:
res = -1, added_new_server = -1 -> added_new_server = -1
Second server:
res = 0, added_new_server = -1 -> added_new_server = 0
Third server:
res = 0, added_new_server = 0 -> no change because added_new_server >= 0

The desired behavior is to update the internal structures if at least one new server was added. The previous implementation didn't do that.

Comment on lines 7701 to 7720
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) {
// Process the read_only field as above and store the first server
vector<MYSQL_ROW> discovered_servers;
discovered_servers.push_back(row);

// Store the remaining servers
int num_rows = mysql_num_rows(mmsd->result);
for (int i = 1; i < num_rows; i++) {
row = mysql_fetch_row(mmsd->result);
discovered_servers.push_back(row);
VALGRIND_DISABLE_ERROR_REPORTING;
if (row[j]) {
if (!strcmp(row[j], "0") || !strcasecmp(row[j], "OFF"))
read_only = 0;
}
VALGRIND_ENABLE_ERROR_REPORTING;
}
}

// Store the remaining servers
int num_rows = mysql_num_rows(mmsd->result);
for (int i = 1; i < num_rows; i++) {
MYSQL_ROW row = mysql_fetch_row(mmsd->result);
discovered_servers.push_back(row);
}

// Process the discovered servers and add them to 'runtime_mysql_servers' (process only for AWS RDS Multi-AZ DB Clusters)
if (!discovered_servers.empty() && is_aws_rds_multi_az_db_cluster_topology(discovered_servers)) {
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd->reader_hostgroup);
// Process the discovered servers and add them to 'runtime_mysql_servers'
if (!discovered_servers.empty()) {
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd, num_fields);
}
} else {
proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port);
rc = (*proxy_sqlite3_bind_null)(statement, 5); ASSERT_SQLITE_OK(rc, mmsd->mondb);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block starting on line 7717 logs an error: proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port);.
This error message might be misleading if the task is a simple read_only check (i.e., is_aws_rds_topology_query_task(mmsd->get_task_type()) is false).
In such a case, fields might be valid and num_fields might be 1 (for read_only), which is correct for a non-topology task. The current structure implies this error is logged if it's not a topology task that also matches its expected fields.

Consider restructuring the conditional logic to ensure this error is only logged when it's genuinely an issue with field fetching/matching for an expected topology query, and not for standard read_only tasks.

Suggested change
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) {
// Process the read_only field as above and store the first server
vector<MYSQL_ROW> discovered_servers;
discovered_servers.push_back(row);
// Store the remaining servers
int num_rows = mysql_num_rows(mmsd->result);
for (int i = 1; i < num_rows; i++) {
row = mysql_fetch_row(mmsd->result);
discovered_servers.push_back(row);
VALGRIND_DISABLE_ERROR_REPORTING;
if (row[j]) {
if (!strcmp(row[j], "0") || !strcasecmp(row[j], "OFF"))
read_only = 0;
}
VALGRIND_ENABLE_ERROR_REPORTING;
}
}
// Store the remaining servers
int num_rows = mysql_num_rows(mmsd->result);
for (int i = 1; i < num_rows; i++) {
MYSQL_ROW row = mysql_fetch_row(mmsd->result);
discovered_servers.push_back(row);
}
// Process the discovered servers and add them to 'runtime_mysql_servers' (process only for AWS RDS Multi-AZ DB Clusters)
if (!discovered_servers.empty() && is_aws_rds_multi_az_db_cluster_topology(discovered_servers)) {
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd->reader_hostgroup);
// Process the discovered servers and add them to 'runtime_mysql_servers'
if (!discovered_servers.empty()) {
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd, num_fields);
}
} else {
proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port);
rc = (*proxy_sqlite3_bind_null)(statement, 5); ASSERT_SQLITE_OK(rc, mmsd->mondb);
}
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) {
// Process the read_only field as above and store the first server

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message slipped to the wrong if clause after refactoring. I'll move it one level up, where it came from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants