Checking whether a list of _id already exist in the collection or not

sandeep_s1 · August 2, 2023, 4:55am

Hi,
I have a list of _id, I need to check whether these _id(s) are already there or not. And I need to get the _id(s) which are not there in the database. Is there any way I can achive this without taking the list from database and comparing it with my list in backend code.

turivishal · August 2, 2023, 5:39am

Hello @sandeep_s1,

Refer to this similar question,

sandeep_s1 · August 2, 2023, 10:11am

Thanks for the suggestion

John_Sewell · August 2, 2023, 10:52am

I had another approach…

Basically we create a new collection called Dual that has one dummy record, all it does it provide a way to inject a record for each of the IDs we’re searching for, in the linked code you replace [‘D’, ‘C’] with your array of Ids to search for:

var matchList = ['D', 'C']
db.getCollection("Test").aggregate([
{
    $project:{
        _id:1
    }
},
{
    $unionWith:{
        coll:'Dual',
        pipeline:[
            {
                $project:{
                    _id:matchList
                },
            },
            {
                $unwind:'$_id'
            }
        ]
    }
},
{
    $group:{
        _id:'$_id',
        total:{$sum:1}
    }
},
{
    $match:{
        total:1
    }
}
])

What this does it get all the IDs from your main collection and then add IDs in from your list of IDs you want to search for and then group up, anything with a count of more than one is a match and we can filter out to return a simple list of IDs that do not exist on the collection.
This way you avoid any complex processing or lookups.

I’ve not tested this on a large data set…but perhaps worth a try to compare to other approaches.

steevej · August 2, 2023, 1:32pm

While at it, for the benefit of all users, could you please provide closure on your other thread

sandeep_s1 · August 3, 2023, 7:25am

@John_Sewell, thanks for the reply.

John_Sewell · August 3, 2023, 7:55am

I just ran a test with 10M records in a collection and checking if 4 IDs exist, took about 40s on my workstation.

Each records was just a basic document with an ID and took up about 140MB of storage, I was trying to push a lot of data through the pipeline.

I has another play and came up with this:

db.getCollection("Test").aggregate([
{
    $project:{
        _id:1
    }
},
{
    $addFields:{
        flag:1
    }
},
{
    $unionWith:{
        coll:'Dual',
        pipeline:[
            {
                $project:{
                    _id:[5,6,'C', 'B']
                },
            },
            {
                $unwind:'$_id'
            },
            {
                $addFields:{
                    flag:-1
                }
            }
        ]
    }
},
{
    $group:{
        _id:'$_id',
        total:{$sum:'$flag'}
    }
},
{
    $group:{
        _id:'$total',
        total:{$sum:1}
    }
}
])

So this categorises the IDs into one of 3 types:

Only in collection: 1
Only in search list: -1
In both search and collection: 0

So you could have a filter as the last stage (as opposed ot the second group) to filter out what you want to return.

(Note the extra $addFields stage as opposed to setting the value in the project as if you set the field value to 1 there, it regards this as a projection inclusion and does not actually set to 1, there must be a better way of doing this.)

So finding items that are in your passed in list and not in the collection:

db.getCollection("Test").aggregate([
{
    $project:{
        _id:1
    }
},
{
    $addFields:{
        flag:1
    }
},
{
    $unionWith:{
        coll:'Dual',
        pipeline:[
            {
                $project:{
                    _id:[5,6,'C', 'B']
                },
            },
            {
                $unwind:'$_id'
            },
            {
                $addFields:{
                    flag:-1
                }
            }
        ]
    }
},
{
    $group:{
        _id:'$_id',
        total:{$sum:'$flag'}
    }
},
{
    $match:{
        total:-1
    }
}])

steevej · August 3, 2023, 4:45pm

Here is an adaptation of the solution from the thread shared by turivishal.

I used the grades collection from the sample_training database from the sample dataset of Atlas.

/* first you insert your _id in a temporary collection, let's name it Dual like John_Sewell did */
lookup =
{
	"$lookup" : {
		"from" : "grades",
		"localField" : "_id",
		"foreignField" : "_id",
		"as" : "_tmp.found" ,
                "pipeline" : [
                    { "$limit : 1" } , /* I still $limit:1 just to make sure the $lookup is stopped fast */
                    { "$project" : { "_id" : 1 } }
                ]
	} ,
}
match = { "$match" : { "_tmp.found.0" : { "$exists" : false } } }
unset = { "$unset" : "_tmp" }
out = { "$out" : "result" }
db.Dual.aggregate( [ lookup , match , unset , out ] )

The above took around 2s on Atlas M0 with 1.7m documents in the grades (I duplicated the original grades collection documents a few time to get 1.7m) collection and 10000 _id presents and 10000 _id not presents.

John_Sewell · August 3, 2023, 5:41pm

That’s much better! Order of magnitude faster!

steevej · August 3, 2023, 5:57pm

The main reason is that $group is expensive. It has to process all incoming documents before outputting the first one. In some case it will need to hit the disk if all $group’s do not fit in memory. Yes, my $lookup is also kind of expensive, but it is using the index on _id.

The $addField is also expensive since it is done for all of 10M docs.

If you still have your 10M and 4 IDs, it would be nice to have numbers with the same dataset.

John_Sewell · August 3, 2023, 9:06pm

First thing I did after seeing your solution, ran in a fraction of a second! Im away from pc at moment but shall post the actual time tomorrow.
As you say, not grouping all that data and instead hitting an index is so much more performant especially as the index is already there!