LAUNCHMongoDB 8.3 is built for the sub-100ms retrieval & zero downtime AI demands. Read blog >
AI DATAStop fighting your data layer. Get the memory & retrieval agents need to scale. Read blog >

MONGODB SYSTEM ALERT:

Node.js v22.7.0 runtime defect can lead to data encoding issues

On August 22, 2024 Node.js v22.7.0 introduced an incorrect optimization for buffer.write which can result in strings being encoded using ISO-8859-1 rather than UTF-8.

Though this issue is not directly within any of MongoDB’s products, the use of the fast API for buffer.write will be disabled with Node.js v22.8.0. Developers using MongoDB’s Node.js driver could experience potential data integrity issues with Node.js v22.7.0.

This issue only manifests if the following conditions are true:

  • Node.js v22.7.0 is being used
  • Documents are being written to MongoDB
  • Those documents contain certain characters (ex: é or ) that are encodable with ISO-8859-1, but not ASCII. As a result Chinese/Japanese text would not be impacted as it’s not encodable using ISO-8859-1

As of September 3, 2024, Node.js v22.8.0 is available and contains a fix for the UTF-8 encoding issue present in Node.js v22.7.0.

Mitigating the Issue

To avoid potential data integrity issues due to this bug in the Node.js runtime it is recommended that Node.js v22.7.0 is not used at all.

MongoDB recommends only using Node.js runtime versions documented as compatible in production environments. At the time of writing, Node.js v22.x is not considered a compatible runtime for use with the MongoDB Node.js driver.

Understanding the Issue

To illustrate how this can occur, consider the following reproduction:

 

When run using a previous version of Node.js, the Buffer length is consistently evaluated for 20K iterations, a document is inserted into a MongoDB collection then successfully retrieved.

 

When the same reproduction is run using Node.js v22.7.0 however, invalid UTF-8 string data can be produced, which would then be inserted into the MongoDB collection, resulting in subsequent retrieval attempts failing.

 

Though MongoDB’s Node.js driver supports UTF-8 validation, that feature applies to decoding BSON strings that are being received from the MongoDB server. As the bug in Node.js v22.7.0 occurs when encoding strings as UTF-8, the invalid data can still be serialized to BSON and written to the database.

Note that if you’ve installed mongosh via homebrew for macOS it’s possible the underlying Node.js runtime may be Node.js v22.7.0, as homebrew auto-upgrades to the latest Node.js version by default. If any data was written to the database from these mongosh instances, and that data contained non-ASCII characters, this encoding issue may have occurred.

Identifying Data Integrity Issues in Active Clusters

MongoDB’s Commercial Support team maintains replica set consistency and remediation tools which can be used in the event of data corruption.

To determine if your data is impacted, the validate.js script can be used for 7.0 or greater as follows:

 

If the output contains entries similar to the following then invalid BSON documents have been detected in those collections.

 

This failure would be logged in the mongod.log as follows, which could be identified by grepping the logs:

 

For releases prior to the 7.0 release, a script can be leveraged to identify documents that contain the strings that have been incorrectly encoded due to the Node.js issue. An example Python script that leverages PyMongo is available as detection.py and has been tested against 4.2.25, 4.4.29, 5.0.28, and 6.0.16. Its output includes the _id and the database and collection.

 

Running the detection script produces a CSV file called to_fix.csv containing the same information output above:

Impacts to Backups

If a backup was taken with the incorrect encoding future restores of those backups will contain this issue. When restoring a backup taken from this period, you must execute the detection and remediation procedure to ensure that the invalid encoding is repaired.

REMEDIATION

Running the detection.py script produces a to_fix.csv file that can then be used to remediate the issue manually, or by running fix.py script. It is recommended that you leverage the script as an example and make it applicable to your environment.

If you are concerned about the impact of this issue, we recommend that you cross reference your data with any other records that can help verify your data integrity. For any further questions, please open a support case or start a chat with the Atlas Support team.

Get started with Atlas today

Get started in seconds. Our free clusters come with 512 MB of storage so you can play around with sample data and get oriented with our platform.
Try FreeContact sales
GET STARTED WITH:
  • 125+ regions worldwide
  • Sample data sets
  • Always-on authentication
  • End-to-end encryption
  • Command line tools