Skip to content

UTF-8 Encoding Corruption (e.g., c3b3 -> f3) under High Concurrent Load with Node.js v22.7.0 #3504

Open
@rudulf71

Description

@rudulf71

Environment:

  • Node.js:
    • Failing: v22.7.0
    • Working: v20.x LTS [Add exact version tested, e.g., v20.11.1]
  • mysql2: [Add exact version used, e.g., 3.9.x - check via npm list mysql2]
  • Sequelize: [Add exact version used, e.g., 6.35.x - check via npm list sequelize]
  • Database: Tested against both an old MySQL Server (version unknown, but confirmed working correctly with PHP/Symfony using utf8mb4) AND a new/current MySQL Server (version [Add version if known, e.g., 8.x]) - the issue occurs identically with both servers. Database, tables, and columns are configured for utf8mb4 / utf8mb4_general_ci.
  • OS: macOS [Add version if known, e.g., Sonoma 14.x]
  • Other: Using Express.js framework.

Bug Description:

When the Node.js application (using Sequelize with mysql2) is subjected to high-frequency concurrent requests (approx. 100+ requests in a short burst, triggered programmatically from a React client), character encoding corruption occurs when reading utf8mb4 data from the MySQL database.

Specifically, UTF-8 multi-byte characters representing accented letters (e.g., 'ó' - bytes c3 b3, 'á' - bytes c3 a1) are incorrectly returned by the driver/Sequelize layer as single bytes corresponding to their apparent ISO-8859-1 (Latin-1) representation (e.g., byte f3 is returned instead of c3 b3, byte e1 instead of c3 a1).

This happens progressively. Initial requests under load return the correct UTF-8 bytes, but subsequent requests start returning the incorrect single bytes. This corrupted data, when sent to a client expecting UTF-8, results in replacement characters (e.g., ``) being displayed, turning "Jimmy Róbár" into "Jimmy Rbr".

The issue is consistently reproducible under this specific high-frequency/concurrent load pattern originating from a programmatic client (React app). It does not occur with single or sequential requests (e.g., from Postman, manual browser refreshes), nor does it depend on the MySQL server version (occurs with both old and new servers).

Steps to Reproduce (Conceptual):

  1. Set up a Node.js v22.7.0 application using the specified versions of Sequelize and mysql2.
  2. Connect to a MySQL database containing utf8mb4 data with multi-byte accented characters (e.g., Hungarian names like 'Róbert', 'Zámbó'). Ensure Sequelize dialectOptions: { charset: 'utf8mb4' } is configured.
  3. Create a simple API endpoint that reads this data via Sequelize (e.g., using Model.findAll or Model.findByPk).
  4. Use a client application or load testing script to send a high number of concurrent or very rapid sequential requests (e.g., 100+ within a few seconds) to this endpoint.
  5. Log the retrieved string data and its hex representation on the server-side immediately after the Sequelize query returns.
  6. Observe the hex values in the logs – initial requests should show correct UTF-8 bytes (e.g., c3b3 for 'ó'), while later requests under continued load will show incorrect bytes (e.g., f3).
  7. Note: Providing a minimal, self-contained reproducible code example is difficult due to the load-dependent nature.

Expected Behavior:

The data retrieved via Sequelize/mysql2 should consistently maintain the correct utf8mb4 encoding, preserving accented characters (e.g., "Róbert" with bytes c3 b3).

Actual Behavior:

Under high-frequency/concurrent load on Node.js v22.7.0, the application starts receiving corrupted data from the Sequelize/mysql2 layer where UTF-8 bytes are replaced by their Latin-1 equivalents.

Log Evidence (from server-side logging immediately after Sequelize.findAll for userId: 20213746):

Example showing data before corruption occurs (initial requests under load):

[DEBUG][2025-03-27T20:54:50.515Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
 -> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052c3b362c3a172), Family='Zámbó' (Hex: 5ac3a16d62c3b3)
 -> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ac3a16d62c3b320)

Note the correct UTF-8 hex bytes: c3b3 for 'ó', c3a1 for 'á'.

Example showing data after corruption occurs (later requests under continued load, approx. 1m 16s later):

[DEBUG][2025-03-27T20:56:06.449Z] listUserdataByUserId userId: 20213746. Adatok KÖZVETLENÜL Sequelize után:
 -> Userdata ID 44799: First='Jimmy Róbár' (Hex: 4a696d6d792052f362e172), Family='Zámbó' (Hex: 5ae16d62f3)
 -> Userdata ID 105317: First='papa' (Hex: 70617061), Family='zámbó ' (Hex: 7ae16d62f320)

Note the incorrect bytes: f3 appears instead of c3b3, e1 appears instead of c3a1. These correspond to Latin-1 representations.

Client-side result (corresponding to the corrupted data): The client receives strings like "Jimmy Rbr", where the accented characters are lost or replaced by replacement characters (``), because the bytes f3 and `e1` are invalid in a UTF-8 context.

Diagnostic Steps Taken:

  • Confirmed database, table, column, and connection character sets are utf8mb4.
  • Confirmed dialectOptions: { charset: 'utf8mb4' } is used in Sequelize.
  • Adding an explicit SET NAMES utf8mb4 COLLATE utf8mb4_general_ci via a Sequelize afterConnect hook did not resolve the issue.
  • Logging data immediately after the Sequelize query returns confirmed that the byte corruption happens before further application code processing.
  • Switching the Node.js version from v22.7.0 to v20.x LTS completely resolved the issue. The corruption no longer occurs under the identical load pattern when running on Node 20.

Possible Cause:

This pattern strongly suggests a potential bug within the mysql2 driver itself, or a problematic interaction between mysql2 and Node.js v22.7.0, specifically related to parsing or handling the MySQL data stream encoding under high concurrency conditions. The resolution upon switching to Node 20 LTS reinforces this suspicion.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions