Hi, for some context, I am trying to build a feed for my application, where data posted by other users is retrieved in random order (not chronological, so you can load up a post from a long time ago, that’s ok) and you can scroll down to load more (pagination).
I have some things I’d like to clarify with this implementation.
- How do I randomly select X posts from a collection without repeats (and none made by that specific user)?
- How do I then paginate this, ie: extract more with no repeats?
Suppose I have collection of Posts as such:
{
_id: "someID",
title: "someTitle",
content: "someContent",
posted_by: "userID"
}
How do I retrieve 10 random posts where posted_by
is NOT equal to the current userID, say "userA"
?
And how should I be implementing pagination where 10 different random posts are queried, again where posted_by !== "userA"
?
I understand that I can use _id
where it’s a Mongo ObjectID and limit
for simple chronological pagination, but how do I incorporate randomness into this as well?
Appreciate any help, thank you!
Hi @Ajay_Pillay ,
One of the built-in ways to do this is by using a $sample operator with a size configured to amount of retrieved documents.
db.Posts.aggregate(
[ { $sample: { size: 10 } } ]
)
Thanks
Pavel
Hi @Pavel_Duchovny, yes I did come across this but:
$sample
may output the same document more than once in its result set.
Is there any way to prevent duplicates?
What do you mean duplicates?
To group?
Oh no what I mean is on the $random
documentation page it says that the selector may return an item more than once (aka items with the same _id)
But I can’t have that for what I need to do, I can’t have duplicate items being returned by the query.
Is that not what the documentation implies?
EDIT: I meant to refer to $sample
not $random
.
$smaple is not $random its a different stage it returns documents once.
Sorry I made a typo, I meant to refer to $sample
.
What does this bit mean when it says $sample
may return a document more than once? Doesn’t that mean there could be duplicates? Or am I not understanding that correctly?
Hi @Ajay_Pillay ,
Ok I never payed attention to this section. Never noticed a duplicate document and I beilive its a super rare condition if you pick just “10” documents.
However, if you really need to ensure uniquness you can do a group by _id and project a new root of the first document only which will verify no document is returned twice.
[{$sample: {
size: 10
}}, {$group: {
_id: "$_id",
result: { $push : "$$ROOT"}
}}, {$replaceRoot: {
newRoot: {$first : "$result" }
}}]
Thanks,
Pavel
Thanks for the clarification!
I need to account for uniqueness because I will be querying for 10 documents at first, and when the user scrolls down the page, I need to query 10 more unique documents, and as this grows the chances of duplicates increases in the subsequent queries.
How should I be approaching this? I understand how to make a single query for 10 unique random documents but how should this be used together with pagination?
@Ajay_Pillay ,
In such a case I suggest that you add a random number to your documents index them and pick 10 random numbers to be queried in a $in query , than pick 10 more making sure they are not already picked before.
Otherwise just pick a way to sort the documents randomly and paginate them
Thanks
Pavel
1 Like
Another option is to run the aggregation on 3000 samples and batch them into 10 documents a batch …
If a user presses 300 times on the next run a new query … No way he will notice a returning result 
2 Likes
Hi @Pavel_Duchovny this is an interesting idea and honestly I don’t think I even need it at 3000, I think 100 is enough. A sample of 100, and then batched into 10 documents a batch.
So once I get this random sample, I paginate 10 documents at a time. Once I reach the end of this 100, I will run another sample of 100, and batch as per before.
How exactly am I supposed to be doing this batching, and saving the $sample
aggregation between queries? For context, I’m running a Meteor/Apollo/React stack, so my GraphQL queries will include an offset and limit argument in my resolvers, and I will use that to handle the batching logic.
Aggregation has a batchSize as part of the aggregation command.
I don’t know the specific of your driver architecture but you can do just query and fetch
No need for skip and limit.
So I tried the following query:
db.getCollection('myDB').aggregate([
{ $sample: { size: 100 } }],
{ cursor: { batchSize: 10 } }
)
But I still see all the 100 samples being returned.
Also I don’t quite understand why I don’t need the idea of skip and limit here. Correct me if I am wrong, but from what I understand I run an aggregation for a sample size of 100 documents once. Then, I can choose which batch to send back, based on the offset and limit.
So my first query would be an offset of 0 and limit of 10. So I return the first batch, and second query I return the second batch (offset 10, limit 10). But if I run this command a second time, wouldn’t it be a different $sample chosen?
I think that the shell overwrite anything below 101 results.
Try this with your driver …
Now the skip and limit is not related to the batch its to offset and limit the entire query and is done on server side …
Skip is a highly unoptimized operation as it needs to first scan all offset before retrieve results while limit just stops after x returned docs
1 Like
I am using aggregation to fetch a single random document from a collection of over 0.3million documents. Its performing very slow. Taking 500ms-1000ms for each query.
Sample Query : db.COLLECTION_NAME.aggregate([{ ‘$match’ : {‘isAvailable’ : true} }, { ‘$sample’ : {size:1}}]);
Did anyone else face performance issue while using aggregation to fetch a random document ?
I think you should sort based on some number, making a rank for a post by relevance using some metrics.
I’m not sure if this is easily applied to be honest