UTF-8 Encoding Corruption (e.g., c3b3 -> f3) under High Concurrent Load with Node.js v22.7.0

**Environment:**

* **Node.js:**
    * Failing: **v22.7.0**
    * Working: **v20.x LTS** [Add exact version tested, e.g., v20.11.1]
* **mysql2:** [Add exact version used, e.g., 3.9.x - check via `npm list mysql2`]
* **Sequelize:** [Add exact version used, e.g., 6.35.x - check via `npm list sequelize`]
* **Database:** Tested against both an **old MySQL Server** (version unknown, but confirmed working correctly with PHP/Symfony using `utf8mb4`) AND a **new/current MySQL Server** (version [Add version if known, e.g., 8.x]) - the issue occurs identically with both servers. Database, tables, and columns are configured for `utf8mb4` / `utf8mb4_general_ci`.
* **OS:** macOS [Add version if known, e.g., Sonoma 14.x]
* **Other:** Using Express.js framework.

**Bug Description:**

When the Node.js application (using Sequelize with `mysql2`) is subjected to high-frequency concurrent requests (approx. 100+ requests in a short burst, triggered programmatically from a React client), character encoding corruption occurs when reading `utf8mb4` data from the MySQL database.

Specifically, UTF-8 multi-byte characters representing accented letters (e.g., 'ó' - bytes `c3 b3`, 'á' - bytes `c3 a1`) are incorrectly returned by the driver/Sequelize layer as single bytes corresponding to their apparent ISO-8859-1 (Latin-1) representation (e.g., byte `f3` is returned instead of `c3 b3`, byte `e1` instead of `c3 a1`).

This happens progressively. Initial requests under load return the correct UTF-8 bytes, but subsequent requests start returning the incorrect single bytes. This corrupted data, when sent to a client expecting UTF-8, results in replacement characters (e.g., ``) being displayed, turning "Jimmy Róbár" into "Jimmy Rbr".

The issue is consistently reproducible under this specific high-frequency/concurrent load pattern originating from a programmatic client (React app). It does **not** occur with single or sequential requests (e.g., from Postman, manual browser refreshes), nor does it depend on the MySQL server version (occurs with both old and new servers).

**Steps to Reproduce (Conceptual):**

1.  Set up a Node.js **v22.7.0** application using the specified versions of Sequelize and `mysql2`.
2.  Connect to a MySQL database containing `utf8mb4` data with multi-byte accented characters (e.g., Hungarian names like 'Róbert', 'Zámbó'). Ensure Sequelize `dialectOptions: { charset: 'utf8mb4' }` is configured.
3.  Create a simple API endpoint that reads this data via Sequelize (e.g., using `Model.findAll` or `Model.findByPk`).
4.  Use a client application or load testing script to send a high number of concurrent or very rapid sequential requests (e.g., 100+ within a few seconds) to this endpoint.
5.  Log the retrieved string data and its hex representation on the server-side immediately after the Sequelize query returns.
6.  Observe the hex values in the logs – initial requests should show correct UTF-8 bytes (e.g., `c3b3` for 'ó'), while later requests under continued load will show incorrect bytes (e.g., `f3`).
7.  *Note:* Providing a minimal, self-contained reproducible code example is difficult due to the load-dependent nature.

**Expected Behavior:**

The data retrieved via Sequelize/`mysql2` should consistently maintain the correct `utf8mb4` encoding, preserving accented characters (e.g., "Róbert" with bytes `c3 b3`).

**Actual Behavior:**

Under high-frequency/concurrent load on Node.js v22.7.0, the application starts receiving corrupted data from the Sequelize/`mysql2` layer where UTF-8 bytes are replaced by their Latin-1 equivalents.

**Log Evidence (from server-side logging immediately after `Sequelize.findAll` for `userId: 20213746`):**

Example showing data **before** corruption occurs (initial requests under load):
```
[DEBUG][2025-03-27T20:54:50.515Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
 -> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052c3b362c3a172), Family='Zámbó' (Hex: 5ac3a16d62c3b3)
 -> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ac3a16d62c3b320)
```
*Note the correct UTF-8 hex bytes: `c3b3` for 'ó', `c3a1` for 'á'.*

Example showing data **after** corruption occurs (later requests under continued load, approx. 1m 16s later):
```
[DEBUG][2025-03-27T20:56:06.449Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
 -> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052f362e172), Family='Zámbó' (Hex: 5ae16d62f3)
 -> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ae16d62f320)
```
*Note the incorrect bytes: `f3` appears instead of `c3b3`, `e1` appears instead of `c3a1`. These correspond to Latin-1 representations.*

*Client-side result (corresponding to the corrupted data):* The client receives strings like "Jimmy Rbr", where the accented characters are lost or replaced by replacement characters (``), because the bytes `f3` and `e1` are invalid in a UTF-8 context.

**Diagnostic Steps Taken:**

* Confirmed database, table, column, and connection character sets are `utf8mb4`.
* Confirmed `dialectOptions: { charset: 'utf8mb4' }` is used in Sequelize.
* Adding an explicit `SET NAMES utf8mb4 COLLATE utf8mb4_general_ci` via a Sequelize `afterConnect` hook did **not** resolve the issue.
* Logging data immediately after the Sequelize query returns confirmed that the byte corruption happens *before* further application code processing.
* **Switching the Node.js version from v22.7.0 to v20.x LTS completely resolved the issue.** The corruption no longer occurs under the identical load pattern when running on Node 20.

**Possible Cause:**

This pattern strongly suggests a potential bug within the `mysql2` driver itself, or a problematic interaction between `mysql2` and Node.js v22.7.0, specifically related to parsing or handling the MySQL data stream encoding under high concurrency conditions. The resolution upon switching to Node 20 LTS reinforces this suspicion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

UTF-8 Encoding Corruption (e.g., c3b3 -> f3) under High Concurrent Load with Node.js v22.7.0 #3504

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

UTF-8 Encoding Corruption (e.g., c3b3 -> f3) under High Concurrent Load with Node.js v22.7.0 #3504

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions