Description
Environment:
- Node.js:
- Failing: v22.7.0
- Working: v20.x LTS [Add exact version tested, e.g., v20.11.1]
- mysql2: [Add exact version used, e.g., 3.9.x - check via
npm list mysql2
] - Sequelize: [Add exact version used, e.g., 6.35.x - check via
npm list sequelize
] - Database: Tested against both an old MySQL Server (version unknown, but confirmed working correctly with PHP/Symfony using
utf8mb4
) AND a new/current MySQL Server (version [Add version if known, e.g., 8.x]) - the issue occurs identically with both servers. Database, tables, and columns are configured forutf8mb4
/utf8mb4_general_ci
. - OS: macOS [Add version if known, e.g., Sonoma 14.x]
- Other: Using Express.js framework.
Bug Description:
When the Node.js application (using Sequelize with mysql2
) is subjected to high-frequency concurrent requests (approx. 100+ requests in a short burst, triggered programmatically from a React client), character encoding corruption occurs when reading utf8mb4
data from the MySQL database.
Specifically, UTF-8 multi-byte characters representing accented letters (e.g., 'ó' - bytes c3 b3
, 'á' - bytes c3 a1
) are incorrectly returned by the driver/Sequelize layer as single bytes corresponding to their apparent ISO-8859-1 (Latin-1) representation (e.g., byte f3
is returned instead of c3 b3
, byte e1
instead of c3 a1
).
This happens progressively. Initial requests under load return the correct UTF-8 bytes, but subsequent requests start returning the incorrect single bytes. This corrupted data, when sent to a client expecting UTF-8, results in replacement characters (e.g., ``) being displayed, turning "Jimmy Róbár" into "Jimmy Rbr".
The issue is consistently reproducible under this specific high-frequency/concurrent load pattern originating from a programmatic client (React app). It does not occur with single or sequential requests (e.g., from Postman, manual browser refreshes), nor does it depend on the MySQL server version (occurs with both old and new servers).
Steps to Reproduce (Conceptual):
- Set up a Node.js v22.7.0 application using the specified versions of Sequelize and
mysql2
. - Connect to a MySQL database containing
utf8mb4
data with multi-byte accented characters (e.g., Hungarian names like 'Róbert', 'Zámbó'). Ensure SequelizedialectOptions: { charset: 'utf8mb4' }
is configured. - Create a simple API endpoint that reads this data via Sequelize (e.g., using
Model.findAll
orModel.findByPk
). - Use a client application or load testing script to send a high number of concurrent or very rapid sequential requests (e.g., 100+ within a few seconds) to this endpoint.
- Log the retrieved string data and its hex representation on the server-side immediately after the Sequelize query returns.
- Observe the hex values in the logs – initial requests should show correct UTF-8 bytes (e.g.,
c3b3
for 'ó'), while later requests under continued load will show incorrect bytes (e.g.,f3
). - Note: Providing a minimal, self-contained reproducible code example is difficult due to the load-dependent nature.
Expected Behavior:
The data retrieved via Sequelize/mysql2
should consistently maintain the correct utf8mb4
encoding, preserving accented characters (e.g., "Róbert" with bytes c3 b3
).
Actual Behavior:
Under high-frequency/concurrent load on Node.js v22.7.0, the application starts receiving corrupted data from the Sequelize/mysql2
layer where UTF-8 bytes are replaced by their Latin-1 equivalents.
Log Evidence (from server-side logging immediately after Sequelize.findAll
for userId: 20213746
):
Example showing data before corruption occurs (initial requests under load):
[DEBUG][2025-03-27T20:54:50.515Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
-> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052c3b362c3a172), Family='Zámbó' (Hex: 5ac3a16d62c3b3)
-> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ac3a16d62c3b320)
Note the correct UTF-8 hex bytes: c3b3
for 'ó', c3a1
for 'á'.
Example showing data after corruption occurs (later requests under continued load, approx. 1m 16s later):
[DEBUG][2025-03-27T20:56:06.449Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
-> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052f362e172), Family='Zámbó' (Hex: 5ae16d62f3)
-> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ae16d62f320)
Note the incorrect bytes: f3
appears instead of c3b3
, e1
appears instead of c3a1
. These correspond to Latin-1 representations.
Client-side result (corresponding to the corrupted data): The client receives strings like "Jimmy Rbr", where the accented characters are lost or replaced by replacement characters (``), because the bytes f3
and `e1` are invalid in a UTF-8 context.
Diagnostic Steps Taken:
- Confirmed database, table, column, and connection character sets are
utf8mb4
. - Confirmed
dialectOptions: { charset: 'utf8mb4' }
is used in Sequelize. - Adding an explicit
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci
via a SequelizeafterConnect
hook did not resolve the issue. - Logging data immediately after the Sequelize query returns confirmed that the byte corruption happens before further application code processing.
- Switching the Node.js version from v22.7.0 to v20.x LTS completely resolved the issue. The corruption no longer occurs under the identical load pattern when running on Node 20.
Possible Cause:
This pattern strongly suggests a potential bug within the mysql2
driver itself, or a problematic interaction between mysql2
and Node.js v22.7.0, specifically related to parsing or handling the MySQL data stream encoding under high concurrency conditions. The resolution upon switching to Node 20 LTS reinforces this suspicion.