Problem:
The mongo documentation explains schema validation and the ability to validate existing documents in two places:
To perform validation checks on existing documents, use the
validate
command or thedb.collection.validate()
shell helper.
db.collection.validate()
also validates any documents that violate the collection’s schema validation rules.
The output of the validate function would then theoretically report the number of all documents that are invalid based on the current validator settings on the collection. This, however, is not what is observed. Mongodb validate() will return that all Documents are valid after adding a validator to a collection with existing documents that are not valid.
In v5 and greater it does report warnings and says to check logs, but in <v4 no information is reported at all.
Reproduce:
- Start a mongo instance
docker run \
--name mongo5_0_validator_test \
-d \
--env=MONGO_INITDB_ROOT_USERNAME=admin \
--env=MONGO_INITDB_ROOT_PASSWORD=password \
mongo:5.0
- Connect to instance via CLI
docker exec -it mongo5_0_validator_test mongo --username admin --password password
- Create test data set, validator, and run validate
use test
db.col1.insert({ name: "joe" });
db.col1.insert({ namE: "bre" });
db.col1.insert({ test: "poe" });
db.runCommand({
collMod: "col1",
validator: {
$jsonSchema: {
bsonType: "object",
properties: {
"_id": { bsonType: "objectId" },
name: { bsonType: "string", description: "test" }
},
additionalProperties: false
}
},
validationLevel: "strict",
validationAction: "error"
});
db.col1.insert({ name: "jack" });
db.col1.insert({ namE: "this throws an error because 'namE' is not a defined property, we still only have 4 documents now, 2 existing are invalid to the schema pre validator addition." });
db.col1.validate()
- Observe validate output
mongo v5 output
In v5 of mongo we at least get a warning that tells us some documents are actually invalid, but the nInvalidDocuments
is still 0
{
"ns" : "test.col1",
"nInvalidDocuments" : 0,
"nrecords" : 4,
"nIndexes" : 1,
"keysPerIndex" : {
"_id_" : 4
},
"indexDetails" : {
"_id_" : {
"valid" : true
}
},
"valid" : true,
"repaired" : false,
"warnings" : [
"Detected one or more documents not compliant with the collection's schema. See logs."
],
"errors" : [ ],
"extraIndexEntries" : [ ],
"missingIndexEntries" : [ ],
"corruptRecords" : [ ],
"ok" : 1
}
mongo v4 output
You’ll notice in v4 there isn’t even any hint that anything may be wrong with the documents.
{
"ns" : "test.col1",
"nInvalidDocuments" : 0,
"nrecords" : 4,
"nIndexes" : 1,
"keysPerIndex" : {
"_id_" : 4
},
"indexDetails" : {
"_id_" : {
"valid" : true
}
},
"valid" : true,
"warnings" : [ ],
"errors" : [ ],
"extraIndexEntries" : [ ],
"missingIndexEntries" : [ ],
"ok" : 1
}
Expected Behavior
I expect, based on the mongo documentation, that the nInvalidDocments
would report the # of Documents that have failed. It does not appear that there is any good way to identify the invalid documents without looking at database logs, which is not very useful.
The Question to Answer
How can you determine all existing invalid Documents in a collection that has had validator added/updated?
Is there any way to iterate over the collection and validate each document in lue of this behavior, especially if this behavior is actually expected?
Possible solutions
The only idea we had in the discord conversation was creating an entire new collection with the new validation on it and bulk inserting the old collection into it to see what fails. This is obviously not an ideal scenario.
It is also not great or appropriate for application logic to do the validation since there is no straightforward way to go from mongo’s custom bson jsonSchema to whatever language/driver you’re using.
I also did not find anything on validating a single Document, only the full collection validate
method exists which would rule out some kind of pipeline/match to figure out all invalid documents.
Previous Conversations:
(I can’t put valid links here because forums block me, I figure linking the docs above were better use of my 2 possible links…)
- I originally asked in the community discord, some discussion and confirmation from another member experiencing the same behavior: https (colon) // discord (dot) com/channels/714857985389625415/714857985943535659/979451499409190912
- An old forum post looked like it was similar issues, asking what is expected out of validate call: https (colon) // www.mongodb (dot) com/community/forums/t/performing-document-validation-from-a-driver-and-db-collection-validate-return-issues/128228