I read the antipattern about massiv numbers of collections with the recommendation not to exceed 10,000 collections in a replica set. If this threshold is crossed, it is was mentioned to be likely to experience a performance loss. The impact of more and more indices was mentioned as one reason, the other one was is that wiretiger stores a file for each collection and each index.
My question is, what performance is being reduced?
Once I am connected to a collection and performing my queries on it, I don’t think it should matter how many other collections exist. The second reason in particular sounds to me like this performance restriction comes into play when I want to get a handle on a collection. Or am I missing something here?
If the handles of my collections don’t change often, it shouldn’t matter how many collections I have, should it?
I think you hit an important point here. Short answer is: 10,000 is a “ballpark figure”, and the actual ceiling of your specific workload could be higher or lower than this.
Long answer: as you have alluded to, the performance hit greatly depends on what you do with the database. You are correct that you can have a lot of collections, but if you barely use any of them, you shouldn’t see the same hit as when you use all of them consistently. For example, given the same number of collection, a database backing a multi-tenant system (where all collections are constantly in use) would behave very differently vs. a database backing an accounting software (where you can keep a lot of historical data but only work on a small number of collection consistently), for instance. Of course this would also depend on the actual hardware you’re using. Larger hardware may give you more ceiling, for instance.
Typically “performance” in this topic means that queries are getting slower. In extreme cases, workloads can stall while the database is too busy trying to process the incoming work.
thank you for your answer, that clarified already some aspects.
Now, you mentioned two interesting things that make me wonder.
First, you mention that this ceiling is hardware dependent. Does that mean that this performance drop results from a constant, always persisting performance overhead that comes with every collection, summing up until a point were even opening a single collection might be slowed down by the mere existence of a large number (i.e. >10,000) of collections?
Second, you are referring to the different uses of databases while the limit description was referring to replica sets. Would it make a difference, if I spread the 10,000 collections amongst multiple databases? I was under the impression that a database is just a hierarchy layer above the collection.
Yes. In the simplest term, MongoDB can only operate on data in memory. If your working set can fit in the RAM you have, you’ll have the best performance. Once MongoDB needs to fetch data from disk, this is where things start to slow down, as disk is frequently the slowest part of a computer.
There are certainly overhead involved in all operations (from MongoDB, the OS, or the hardware), so this will exacerbate any issue when the workload hits the hardware’s limitations. Put more workload into it, then the server would spend significant time just loading/unloading data to/from disk.
You are correct. In MongoDB, collection is the basic unit, and “database” basically acts as a namespace for the collection. So the number of collection is the one that matters.