[Performance] Best Practice for large arrays per document

Andreas_Dahm · August 21, 2023, 8:43am

Hi,

I’m pretty new to work with mongo and I have a question, where I cannot find a good answer for.
The current circumstances are given:

I have to use a quite old mongo version (4.0.28)
I also have to use a quite old Java (jdk 8) + SpringFramework version 5.1.9

Also for my testcase I have the following simpel approach. One document with
an ID, name, description, owner, createDate, updatedDate und permissionSet array.

Here one example:

{
  "_id": {
    "$oid": "64e31163e4b086f9c5a4fe6a"
  },
  "_class": "de.dmi.dmixcloud.core.model.TestObject",
  "name": "RandomName: 0",
  "description": "Random Description 0",
  "owner": "5f0707690cf241886f45d125",
  "creationDate": {
    "$date": "2023-08-21T07:25:23.407Z"
  },
  "updatedDate": {
    "$date": "2023-08-21T07:25:23.407Z"
  },
  "permissionsSet": []
}

Now it goes about the permissionSet-Array. This will keep a list of all user ids which have access to this document. It can be that several thousand users will have access to it. That means this array will contain easily thousands of strings.

Now I recorgnized, that as bigger that array is as slower is a $push call working. Not tried what about quering of this array right now.

I’m clear that I try to do relations in a document based database, but this is something I cannot change right now. Question is now, is there a better approach to model this, instead of keeping all users which have access in a long list. I also thought about to have some kind of relation collection, with userid and document id as compound key, and then working with aggregations if I want to ask for all documents where user XY has access to.

It looks like a realtive “normal” problem, so I’m not sure why I found so little about it, maybe I’m searching jsut wrong. Maybe someone can help me, which approach is the best in terms of runtime.

Thanks and best
Andreas

NeNaD · August 21, 2023, 10:12am

Did you consider the reverse approach?

You can store all the user’s permissions in the user model and not vice-versa.

So, for example, you can create permissions array in the user model and put all of the permissions that the user has. That way, even if you have a lot of users it will scale (if you don’t have thousands of the permissions).

In your current approach, you should consider the MongoDB document limit of 16MB. So if you would have hundreds of thousands of user and you want to store the _id of all of them in the permission document, you can reach the limit.

Andreas_Dahm · August 21, 2023, 11:08am

Hi Nenad,

thanks for the idea, that indeed something I can consider, but if I have thousands of permissions I will face the same issue again, right? In that case a relation collection with document to user id would be the way to go? Sure this collection will have a lot if document, but each one should be very small.

Best
Andreas

NeNaD · August 21, 2023, 11:28am

You can store up to ~1M ObjectIds in some arrays and not pass the 16MB document limitation.

Storing user IDs in permission document can easily pass 1M if you would have more than 1M users that should have some permission.

Storing permission IDs in the user document will probably not pass the 16MB limitation (if you don’t have more than 1M different permissions). However, this approach has one additional benefit:

You will probably need to fetch user information anyway in the backend before you do the business logic. And having all the permission already inside the user is better because you can immediately validate if the user has the specific permission for the backend action.
If you would create a new collection and store relationship between the user and permission, you would always have to perform an additional database call in order to check if the user has permission for the backend action.

Andreas_Dahm · August 21, 2023, 11:34am

Totally agree, therefore I tried to store users to documents (which will represent items in the platform) because I thought we will have less users than items, for example 10million items but just 5k users.

As testcase i created a collection with 10k documents representing my items and 2k users IDs and I wanted to use push and each to add all 2000k user ids to all 10k items and it took a while ~20s on my testmachine. If I now want to add additonal 2k users, it tooks longer than 60s. Sure this is not an operation which happens very often, but it should not scale that bad?

Maybe I have to think about another approach, seems to be to much for the db right now.

Andreas_Dahm · August 23, 2023, 11:20am

Hi,

me again. I’m currently not sure how to continue with my usecase, maybe I’m doing something totally wrong. The problem I have is the massive amount of data and the many-to-many relation with this.

In my system I have a collection with items with 15 million documents and also round about 5k users. The worst case is that each user has access to each item.

First I thought, I can add to each of the 15million item documents an array with the user ids who having access to this item. If I now wanted to find all items a given user has access to, I could do this with one query in the items collection. But having an array of hundreds or even thousand fields on each document takes an eternity to create or even to push or pull on these arrays.

For sure an array for each user about the items he can access is also not recommended, because in the worst case the array contains 15 million entries.

So next I tried, and also found as a solution for this problem, is that if I cannot reduce the problem from a many-to-many relation to a one-to-many relation I could go for a in-between table. Something which is normal on sql databases to have a relation table.

Here the next problem, one collection is not big enough for my use case, because I easily will bypass the 2^32 limit of documents per collection if 2000 users having access to 1 million items, I’m close to the max.

So I thought whats about having for each user his own access collection, but then I found out that this is not recommended and an anti-pattern. So on this point my question, how can I handle this with mongo. Sadly switching to an sql based database is no option.

Nevertheless I tried it out and for my testcase it works, it created an access relation collection on my machine (notebook with mongo 4 running on docker) in roughly 6 minutes, which is ok. Because this task will happen not often in production and it shows the worst case to create 15million documents.

The next point is how can I speed up quering on this large collection? Each document has only 2 ids the uder and the object id and both ids are together a compound index. If I now query for 10k items if a specific user has access, it takes around 30s. Might be related to my machine, but 30s is quite long, is there a way to reduce the time?

I know that this is a realy tough problem and the given cirumstances (mongo version 4, mongo as db to handle raltions, many-to-many relations) are not optimal, but I just wanted to know if I mussed something here or if it is what it is right now?

Thanks and best

Reland_Boyle · June 10, 2024, 6:57am

@Andreas_Dahm It’s almost a year later but I would love to know how this project turned out. Did you stick with the relation map in production?

As I read this thread, I keep wondering if your documents or your users have categories. For example, if documents have categories like [ “sensitive”, “private”, “public” ] then you can grant users permission to access a category of documents. Categories may also be things like topic, business unit, publish date, content type, etc.

Likewise, if users have categories like [ “user”, “employee”, “manager”, “admin” ] then documents can be designated as accessible to categories of users. User categories could also be things like company rank, job function, business unit, team assignment, etc.

Andreas_Dahm · June 10, 2024, 8:06am

@Reland_Boyle nice thanks for asking, didn’t expect that sadly i left the company in the meantime. So I cannot say how it is solved in the meantime. My last status was that it was planned to go with the relation table.