-
Notifications
You must be signed in to change notification settings - Fork 1k
Extending support for rds_topology metadata handling #4992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3.0
Are you sure you want to change the base?
Conversation
… Deployment topologies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @nielsvogell, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request lays the groundwork for supporting AWS RDS Blue/Green deployments within the monitor. It extends the existing rds_topology
discovery mechanism to recognize the Blue/Green table format, switch to a more detailed query, and process the results, although full handling of roles and statuses is left for future commits. The feature is disabled by default.
Highlights
- Blue/Green Deployment Topology Discovery: I've added initial support for discovering servers involved in AWS RDS Blue/Green deployments by querying the extended
mysql.rds_topology
table which includesrole
,status
, andversion
columns. - Dynamic Query Switching: The monitor now dynamically determines whether the
rds_topology
table contains Multi-AZ cluster information (3 rows) or Blue/Green deployment information (even number of rows) and switches the query to fetch the appropriate columns on subsequent checks. - Refactored Topology Processing: The logic for processing discovered topology has been refactored to handle the different structures of the
rds_topology
table for Multi-AZ and Blue/Green deployments, preparing for future handling of roles and statuses. - New Monitor Task Types: Introduced new monitor task types (
MON_READ_ONLY__AND__AWS_RDS_BLUE_GREEN_TOPOLOGY_DISCOVERY
, etc.) to represent the different topology queries.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces foundational changes to support AWS RDS Blue/Green deployment topology discovery. It adds new monitoring task types, SQL queries for the extended mysql.rds_topology
schema, and refactors the logic for processing discovered topologies. The changes appear to be a solid step towards the full feature, with good attention to detail in parsing and state management. Key improvements include more robust checks for Multi-AZ cluster identification and the use of unordered_set
for better performance in process_discovered_topology
.
A few areas to double-check include the updated logic for added_new_server
in MySQL_HostGroups_Manager.cpp
to ensure it aligns with the intended downstream effects, and the error logging in monitor_read_only_process_ready_tasks
to avoid potentially misleading messages for non-topology tasks. The use of assert
for runtime_mysql_servers
should also be reviewed for production robustness.
lib/MySQL_Monitor.cpp
Outdated
@@ -3456,47 +3502,70 @@ void MySQL_Monitor::process_discovered_topology(const std::string& originating_s | |||
char *query=(char *)"SELECT DISTINCT hostname FROM monitor_internal.mysql_servers ORDER BY hostname"; | |||
proxy_debug(PROXY_DEBUG_ADMIN, 4, "%s\n", query); | |||
monitordb->execute_statement(query, &error, &cols, &affected_rows, &runtime_mysql_servers); | |||
assert(runtime_mysql_servers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert(runtime_mysql_servers)
is good for debugging, but it becomes a no-op in release builds. If monitordb->execute_statement
can return nullptr
for runtime_mysql_servers
without setting the error
flag (e.g., if the table is empty), this could lead to a dereference of a null pointer when accessing runtime_mysql_servers->rows
.
Consider adding an explicit null check for runtime_mysql_servers
after the execute_statement
call, in addition to checking the error
flag, to prevent potential crashes in production if the function contract allows *resultset
to be nullptr
on success with no rows.
assert(runtime_mysql_servers); | |
if (error) { | |
proxy_error("Error on %s : %s\n", query, error); | |
} else { | |
if (runtime_mysql_servers == nullptr) { | |
proxy_error("runtime_mysql_servers is NULL but no error was returned from execute_statement. Possible memory corruption?\n", query); | |
return; // Or some other appropriate error handling | |
} | |
assert(runtime_mysql_servers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur. assert
is used in combination with MyHG->execute_query
, which is where I got it from but it is actually not required here. Will remove that line.
@@ -1077,7 +1077,7 @@ class MySQL_HostGroups_Manager : public Base_HostGroups_Manager<MyHGC> { | |||
void set_Readyset_status(char *hostname, int port, enum MySerStatus status); | |||
unsigned long long Get_Memory_Stats(); | |||
|
|||
void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, int>>& new_servers); | |||
void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, long, int>>& new_servers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of long
to the tuple suggests that hostgroup IDs can now be larger than int
. Ensure that all related code (e.g., database schema, other functions using hostgroup IDs) are updated to handle long
values to prevent potential truncation or overflow issues.
void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(const vector<tuple<string, int, long, int>>& new_servers);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aligned the data type with the existing extraction. Reading further in the code, it seems that this long
is actually converted to uint_32
implicitly right after. Did this have a reason in the original implementation? Happy to change both the new and the existing variable to use int
or unsigned int
instead.
int res = create_new_server_in_hg(hostgroup_id, srv_info, srv_opts); | ||
if (added_new_server < 0) { | ||
added_new_server = res; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for added_new_server
has changed. Previously, it would reflect the status of the last server addition attempt. Now, it reflects the status of the first attempt if it was successful (0), or the error code of the first failure if all prior attempts also failed.
For example:
- If
create_new_server_in_hg
returns 0 for success, -1 for failure:- Old:
S, S, F
->added_new_server = -1
- New:
S, S, F
->added_new_server = 0
(from the first S) - Old:
F, S, S
->added_new_server = 0
- New:
F, S, S
->added_new_server = -1
(from the first F)
- Old:
Confirm if this change in how added_new_server
is determined aligns with the intended behavior. If the goal is to know if any server was successfully added, a boolean flag might be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but wouldn't F, S, S
result in added_new_server = 0
?
First server:
res = -1
, added_new_server = -1
-> added_new_server = -1
Second server:
res = 0
, added_new_server = -1
-> added_new_server = 0
Third server:
res = 0
, added_new_server = 0
-> no change because added_new_server >= 0
The desired behavior is to update the internal structures if at least one new server was added. The previous implementation didn't do that.
lib/MySQL_Monitor.cpp
Outdated
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) { | ||
// Process the read_only field as above and store the first server | ||
vector<MYSQL_ROW> discovered_servers; | ||
discovered_servers.push_back(row); | ||
|
||
// Store the remaining servers | ||
int num_rows = mysql_num_rows(mmsd->result); | ||
for (int i = 1; i < num_rows; i++) { | ||
row = mysql_fetch_row(mmsd->result); | ||
discovered_servers.push_back(row); | ||
VALGRIND_DISABLE_ERROR_REPORTING; | ||
if (row[j]) { | ||
if (!strcmp(row[j], "0") || !strcasecmp(row[j], "OFF")) | ||
read_only = 0; | ||
} | ||
VALGRIND_ENABLE_ERROR_REPORTING; | ||
} | ||
} | ||
|
||
// Store the remaining servers | ||
int num_rows = mysql_num_rows(mmsd->result); | ||
for (int i = 1; i < num_rows; i++) { | ||
MYSQL_ROW row = mysql_fetch_row(mmsd->result); | ||
discovered_servers.push_back(row); | ||
} | ||
|
||
// Process the discovered servers and add them to 'runtime_mysql_servers' (process only for AWS RDS Multi-AZ DB Clusters) | ||
if (!discovered_servers.empty() && is_aws_rds_multi_az_db_cluster_topology(discovered_servers)) { | ||
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd->reader_hostgroup); | ||
// Process the discovered servers and add them to 'runtime_mysql_servers' | ||
if (!discovered_servers.empty()) { | ||
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd, num_fields); | ||
} | ||
} else { | ||
proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port); | ||
rc = (*proxy_sqlite3_bind_null)(statement, 5); ASSERT_SQLITE_OK(rc, mmsd->mondb); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else
block starting on line 7717 logs an error: proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port);
.
This error message might be misleading if the task is a simple read_only check (i.e., is_aws_rds_topology_query_task(mmsd->get_task_type())
is false).
In such a case, fields
might be valid and num_fields
might be 1 (for read_only
), which is correct for a non-topology task. The current structure implies this error is logged if it's not a topology task that also matches its expected fields.
Consider restructuring the conditional logic to ensure this error is only logged when it's genuinely an issue with field fetching/matching for an expected topology query, and not for standard read_only tasks.
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) { | |
// Process the read_only field as above and store the first server | |
vector<MYSQL_ROW> discovered_servers; | |
discovered_servers.push_back(row); | |
// Store the remaining servers | |
int num_rows = mysql_num_rows(mmsd->result); | |
for (int i = 1; i < num_rows; i++) { | |
row = mysql_fetch_row(mmsd->result); | |
discovered_servers.push_back(row); | |
VALGRIND_DISABLE_ERROR_REPORTING; | |
if (row[j]) { | |
if (!strcmp(row[j], "0") || !strcasecmp(row[j], "OFF")) | |
read_only = 0; | |
} | |
VALGRIND_ENABLE_ERROR_REPORTING; | |
} | |
} | |
// Store the remaining servers | |
int num_rows = mysql_num_rows(mmsd->result); | |
for (int i = 1; i < num_rows; i++) { | |
MYSQL_ROW row = mysql_fetch_row(mmsd->result); | |
discovered_servers.push_back(row); | |
} | |
// Process the discovered servers and add them to 'runtime_mysql_servers' (process only for AWS RDS Multi-AZ DB Clusters) | |
if (!discovered_servers.empty() && is_aws_rds_multi_az_db_cluster_topology(discovered_servers)) { | |
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd->reader_hostgroup); | |
// Process the discovered servers and add them to 'runtime_mysql_servers' | |
if (!discovered_servers.empty()) { | |
process_discovered_topology(originating_server_hostname, discovered_servers, mmsd, num_fields); | |
} | |
} else { | |
proxy_error("mysql_fetch_fields returns NULL, or mysql_num_fields is incorrect. Server %s:%d . See bug #1994\n", mmsd->hostname, mmsd->port); | |
rc = (*proxy_sqlite3_bind_null)(statement, 5); ASSERT_SQLITE_OK(rc, mmsd->mondb); | |
} | |
if (fields && is_aws_rds_topology_query_task(mmsd->get_task_type()) && mysql_row_matches_query_task(field_names, mmsd->get_task_type())) { | |
// Process the read_only field as above and store the first server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message slipped to the wrong if
clause after refactoring. I'll move it one level up, where it came from.
Context
ProxySql currently supports the discovery of AWS RDS Multi-AZ Cluster topologies by querying the
mysql.rds_topology
table on every nth (withn = monitor_aws_rds_topology_discovery_interval
) read-only check. Therds_topology
table contains information about cluster instances in the formatThe same table is used for RDS databases in Blue/Green Deployments, where instead of cluster instances, it contains information about the blue and the green endpoints. Its schema is extended by three additional columns (
role
,status
, andversion
), so that the data looks like this:Blue/Green Deployments
An RDS blue/green deployment is created for an existing (blue) RDS database (DB instance or cluster) and creates a read-only copy (green) of that database on a new server. The green environment may be modified (e.g., by performing an engine version upgrade). When ready, a switchover can be performed, which will promote the green database to a standalone and change its endpoint to the original (blue) server's name.
Changes in this PR
Note: this PR only contains the first commit towards a full feature. The remaining commits are in this working branch. I thought that reviewing them step by step might make the process easier.
Also note that currently this feature is by default disabled (
monitor_aws_rds_topology_discovery_interval = 0
), which is recommended until all required changes for the full feature are merged.This PR adds support to reading the extended
rds_topology
table for both theREAD_ONLY
and theINNODB_READ_ONLY
check. To preserve backwards compatibility, by default, the topology query remains the same and only when the discovered topology matches one found in blue/green deployments (even number of entries as opposed to the three found in Multi-AZ DB Clusters), do we switch the query to include the additional columns.Highlevel change list:
MON_INNDOB_READ_ONLY
andMON_READ_ONLY
)process_discovered_topology
to see if the query result matches the expectations and whether to switch to querying for blue/green deployment metadatastatus
androle
columns (different to Multi-AZ DB Clusters, the server records from a blue/green deployment will need to be dynamically updated during a switchover)How was this tested
read_only
check. Verified that the other two servers are discovered as expected.mysql.rds_topology
table with records as in the example above. Added the blue endpoint as a server into a hostgroup withread_only
check enabled. Verified that the green endpoint was discovered and added to the runtime server list as expected.Future steps
To make the feature complete, we nee to add a few more partial features. Roughly these are
read_only
polling frequency during a switchover to minimize downtime (allows to recognize when the green database is promoted, without waiting for DNS changes to take effect and be noticed by ProxySQL)