MONGODB SYSTEM ALERT:

Node.js v22.7.0 runtime defect can lead to data encoding issues

On August 22, 2024 Node.js v22.7.0 introduced an incorrect optimization for buffer.write which can result in strings being encoded using ISO-8859-1 rather than UTF-8.

Though this issue is not directly within any of MongoDB’s products, the use of the fast API for buffer.write will be disabled with Node.js v22.8.0. Developers using MongoDB’s Node.js driver could experience potential data integrity issues with Node.js v22.7.0.

This issue only manifests if the following conditions are true:

Node.js v22.7.0 is being used
Documents are being written to MongoDB
Those documents contain certain characters (ex: é or ) that are encodable with ISO-8859-1, but not ASCII. As a result Chinese/Japanese text would not be impacted as it’s not encodable using ISO-8859-1

As of September 3, 2024, Node.js v22.8.0 is available and contains a fix for the UTF-8 encoding issue present in Node.js v22.7.0.

Mitigating the Issue

To avoid potential data integrity issues due to this bug in the Node.js runtime it is recommended that Node.js v22.7.0 is not used at all.

MongoDB recommends only using Node.js runtime versions documented as compatible in production environments. At the time of writing, Node.js v22.x is not considered a compatible runtime for use with the MongoDB Node.js driver.

Understanding the Issue

To illustrate how this can occur, consider the following reproduction:

import { MongoClient } from "mongodb";

const client = new MongoClient("mongodb://localhost:27017/test");
const value = 'bébé';

async function run() {
  try {
    console.log(`Running Node ${process.versions.node}`);
    const coll = client.db("test").collection("foo");
    await coll.drop();

    let i = 0;
    while (Buffer.from(value).length === 6 && i < 20000) { i++ }

    await coll.insertOne({ _id: 1, message: value });
    const doc = await coll.findOne({ _id: 1 });
    console.log(`Found doc ${JSON.stringify(doc)}`);
  } finally {
    await client.close();
  }
}
run().catch(console.dir);

When run using a previous version of Node.js, the Buffer length is consistently evaluated for 20K iterations, a document is inserted into a MongoDB collection then successfully retrieved.

Running Node 22.6.0
Found doc {"_id":1,"message":"bébé"}

When the same reproduction is run using Node.js v22.7.0 however, invalid UTF-8 string data can be produced, which would then be inserted into the MongoDB collection, resulting in subsequent retrieval attempts failing.

Running Node 22.7.0
BSONError: Invalid UTF-8 string in BSON document
    at parseUtf8 (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:148:19)
    at Object.toUTF8 (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:273:21)
    ... 6 lines matching cause stack trace ...
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async Collection.findOne (/Users/alex/temp/test-node/node_modules/mongodb/lib/collection.js:274:21) {
  [cause]: TypeError: The encoded data was not valid for encoding utf-8
      at TextDecoder.decode (node:internal/encoding:443:16)
      at parseUtf8 (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:145:37)
      at Object.toUTF8 (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:273:21)
      at deserializeObject (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:2952:31)
      at internalDeserialize (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:2863:12)
      at Object.deserialize (/Users/alex/temp/test-node/node_modules/bson/lib/bson.cjs:4335:12)
      at OnDemandDocument.toObject (/Users/alex/temp/test-node/node_modules/mongodb/lib/cmap/wire_protocol/on_demand/document.js:208:28)
      at CursorResponse.shift (/Users/alex/temp/test-node/node_modules/mongodb/lib/cmap/wire_protocol/responses.js:207:35)
      at FindCursor.next (/Users/alex/temp/test-node/node_modules/mongodb/lib/cursor/abstract_cursor.js:222:41)
      at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
    code: 'ERR_ENCODING_INVALID_ENCODED_DATA'
  }
}

Though MongoDB’s Node.js driver supports UTF-8 validation, that feature applies to decoding BSON strings that are being received from the MongoDB server. As the bug in Node.js v22.7.0 occurs when encoding strings as UTF-8, the invalid data can still be serialized to BSON and written to the database.

Note that if you’ve installed mongosh via homebrew for macOS it’s possible the underlying Node.js runtime may be Node.js v22.7.0, as homebrew auto-upgrades to the latest Node.js version by default. If any data was written to the database from these mongosh instances, and that data contained non-ASCII characters, this encoding issue may have occurred.

Identifying Data Integrity Issues in Active Clusters

MongoDB’s Commercial Support team maintains replica set consistency and remediation tools which can be used in the event of data corruption.

To determine if your data is impacted, the validate.js script can be used for 7.0 or greater as follows:

curl -O "https://raw.githubusercontent.com/mongodb/support-tools/master/replset-consistency/validate.js" && mongosh "mongodb://localhost:27017/test" validate.js --eval "validateFull=true" 2>&1 | tee results.json | grep -v "warnings\":\[\]"

If the output contains entries similar to the following then invalid BSON documents have been detected in those collections.

{"ns":"test.foo","uuid":{"$binary":{"base64":"b4//q+dUQrKN5LAQha+FIg==","subType":"04"}},"nInvalidDocuments":0,"nNonCompliantDocuments":1,"nrecords":1,"nIndexes":1,"keysPerIndex":{"_id_":1},"indexDetails":{"_id_":{"valid":true}},"valid":true,"repaired":false,"warnings":["Detected one or more documents in this collection not conformant to BSON specifications. For more info, see logs with log id 6825900"],"errors":[],"extraIndexEntries":[],"missingIndexEntries":[],"corruptRecords":[],"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1724865657,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"00"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1724865657,"i":1}}}

This failure would be logged in the mongod.log as follows, which could be identified by grepping the logs:

{"t":{"$date":"2024-08-30T04:37:46.769+00:00"},"s":"W", "c":"STORAGE", "id":6825900, "ctx":"conn7","msg":"Document is not conformant to BSON specifications","attr":{"recordId":"1","reason":{"code":378,"codeName":"NonConformantBSON","errmsg":"Found string that doesn't follow UTF-8 encoding. in element with field name 'validString' in object with _id: ObjectId('66d1221af66b999c954cd518')"}}}

For releases prior to the 7.0 release, a script can be leveraged to identify documents that contain the strings that have been incorrectly encoded due to the Node.js issue. An example Python script that leverages PyMongo is available as detection.py and has been tested against 4.2.25, 4.4.29, 5.0.28, and 6.0.16. Its output includes the _id and the database and collection.

> python3 detection.py
Document with _id 66d53e17cd331a1c16e58d1b in myProject.documents needs fixing
Document with _id 66d54857211816027147a245 in myProject.documents needs fixing
Document with _id 66d54857211816027147a246 in myProject.documents needs fixing
Document with _id 66d54857211816027147a247 in myProject.documents needs fixing
Document with _id 66d54857211816027147a248 in myProject.documents needs fixing
Document with _id 66d54857211816027147a249 in myProject.documents needs fixing

Running the detection script produces a CSV file called to_fix.csv containing the same information output above:

collection,_id
myProject.documents,"{""$oid"": ""66d53e17cd331a1c16e58d1b""}"
myProject.documents,"{""$oid"": ""66d54857211816027147a245""}"
myProject.documents,"{""$oid"": ""66d54857211816027147a246""}"
myProject.documents,"{""$oid"": ""66d54857211816027147a247""}"
myProject.documents,"{""$oid"": ""66d54857211816027147a248""}"
myProject.documents,"{""$oid"": ""66d54857211816027147a249""}"

Impacts to Backups

If a backup was taken with the incorrect encoding future restores of those backups will contain this issue. When restoring a backup taken from this period, you must execute the detection and remediation procedure to ensure that the invalid encoding is repaired.

REMEDIATION

Running the detection.py script produces a to_fix.csv file that can then be used to remediate the issue manually, or by running fix.py script. It is recommended that you leverage the script as an example and make it applicable to your environment.

If you are concerned about the impact of this issue, we recommend that you cross reference your data with any other records that can help verify your data integrity. For any further questions, please open a support case or start a chat with the Atlas Support team.