Performance Impact of 955 indexes in one database

Gareth_Furnell · October 8, 2024, 8:23am

Hi There

if a write heavy single database split between 119 collections has an accumulation of 955 indexes, with an average of 14 indexes for each collection - what would the performance impact be?
What we have noticed is that since it is a write heavy deployment (Real Time metadata system) and popular collections grow quite quickly, we start to see write queues, an increase in replication lag and disk latency at 999ms (7200rpm disks) even though those collection sizes have only reached 400GB.

what we have been doing to alleviate the pressure that builds up is doing collection renames and archiving the old data quite often, are there other ways to allow data to grow, as we have ample TB of storage without reaching this write performance degradation?

Mayank_Anand2 · October 8, 2024, 4:06pm

Hi,
What i understood and over some stackoverflow, online mongo doc and AI help, I come up with -

The performance issues you’re seeing—such as write queues, replication lag, and disk latency—are likely due to the combination of write-heavy workloads, a large number of collections with a high number of indexes, and suboptimal disk performance. Here’s an analysis of the potential performance impacts and strategies to mitigate them.

Performance Impact of Many Indexes in a Write-Heavy Environment

Write Amplification:

MongoDB must update every relevant index each time a write operation occurs. With an average of 14 indexes per collection, every write operation triggers updates to each of those indexes, significantly increasing the overall write workload.
In a write-heavy environment, maintaining multiple indexes can lead to write amplification, where each insert, update, or delete involves modifying multiple structures (the data itself and the corresponding indexes), which drastically increases disk I/O.

Replication Lag:

In replica sets, secondaries need to replicate not only the data but also all the index updates. If the primary is bogged down by index updates, the secondaries will struggle to keep up, leading to replication lag.
As your collections grow larger and have more indexes, the replication lag can become more pronounced, especially with a high number of write operations.

Disk Latency:

The 7200 RPM disks you are using may not be fast enough to handle the high level of random I/O generated by the write-heavy operations and the need to maintain many indexes. The latency you’re seeing (999ms) indicates that the disk is under significant strain.
As collection sizes increase, disk seeks and I/O wait times will compound, leading to slower writes and an accumulation of operations in the write queue.

Collection Renaming and Archiving:

Renaming and archiving collections helps temporarily relieve pressure by reducing the active dataset and the corresponding index maintenance. However, this is a reactive strategy and requires frequent manual intervention, which may not scale well as data continues to grow.

Strategies to Mitigate Write Performance Degradation

1. Evaluate and Optimize Indexes:

Index Analysis: Perform an audit of your indexes using the MongoDB explain function or the $indexStats command. Identify unused or underutilized indexes and remove them. Fewer indexes will reduce the write amplification problem and improve write performance.
Compound Indexes: Replace single-field indexes with compound indexes where possible. For example, if multiple queries often use the same fields, a compound index can be more efficient than maintaining several separate indexes.

Command to check index usage:

db.collection.getIndexes()  # View indexes
db.collection.aggregate([ { $indexStats: {} } ])  # Get index usage stats

2. Sharding:

Since you’re working with a write-heavy system, consider implementing sharding to distribute the data and workload across multiple nodes.
With sharding, you can split your collections across different shards based on a sharding key, allowing MongoDB to handle more writes in parallel. This can significantly reduce replication lag and write queues.
Choose a sharding key that effectively distributes the write load across multiple shards (e.g., a high-cardinality field like user ID, timestamp, etc.).

3. Storage Upgrades:

Upgrade to SSDs: If possible, upgrading to SSDs (Solid-State Drives) would provide a massive performance boost, especially for write-heavy workloads. SSDs are much faster than 7200 RPM disks and can handle the random I/O patterns that MongoDB generates.
RAID Configuration: If upgrading to SSDs is not possible, consider using RAID 10 (a combination of RAID 1 and RAID 0) to improve both read and write performance on HDDs by increasing redundancy and striping data across multiple drives.

4. Write Concern and Acknowledgement:

Tune your write concern settings. If you don’t need strong write guarantees for every operation, consider lowering the write concern level. For example, changing the write concern from w: "majority" to w: 1 will reduce the replication overhead.
This can reduce the wait time for write acknowledgment and improve throughput, although it comes with the risk of losing some data in the event of a failure.

5. Batching Writes:

Batch inserts or updates: If possible, batch writes together instead of performing single document writes. This reduces the number of write operations MongoDB needs to handle, reducing the load on indexes and I/O.
Use bulk operations in MongoDB (bulkWrite) for handling multiple write operations in a single request, which can optimize performance.

6. Indexing Strategy for Large Collections:

Consider using partial indexes or sparse indexes for certain fields where possible. These types of indexes only store a subset of the data, reducing the overhead on writes.
TTL indexes can be useful for expiring old data automatically if you only need to retain recent data in real time.

7. Limit Collection Size Using Capped Collections (Where Possible):

For collections that only need to store real-time data and don’t require long-term persistence, use capped collections. Capped collections automatically remove the oldest documents as new ones are added, reducing the need for manual archiving.

8. Optimize Replication Settings:

Replication Oplog Size: Increase the size of the oplog (replication log). A larger oplog can help reduce replication lag by allowing secondaries more time to catch up when replication falls behind.
Adjust priority and hidden settings in your replica set to better control the behavior of secondaries and how they handle replication in write-heavy environments.

Conclusion:

The combination of a high number of indexes, a write-heavy workload, and slow disks is causing performance bottlenecks in your system. Here’s a prioritized plan to alleviate the pressure:

Audit and optimize your indexes—remove unused or redundant indexes and switch to compound or partial indexes where applicable.
Upgrade storage—consider SSDs or RAID configurations to improve I/O performance.
Implement sharding to distribute write operations across multiple nodes.
Tune write concern and use batch operations to reduce write overhead.
Consider advanced indexing strategies like partial and TTL indexes.

Thanks
Mayank

Fabio_Ramohitaj · October 8, 2024, 9:03pm

Hi @Gareth_Furnell,
I’ll link you to an article about having such a large number of collections being an anti-pattern:

As for having such a high number of indexes, it too could lead to slowdowns in writing since the binary tree has to be rebuilt each time for about 14 indexes per collection. You should consider which compound indexes also cover the single indexes.

It really is a significant amount of work that this database does!
Keep me posted on your strategies as it is a very interesting use case!

Best Regards

Saipraneeth_Vaddineni · October 21, 2024, 1:23pm

Hi @Gareth_Furnell ,

You may face dhandle(data handles) issue based on the count of your Collections and Indexes assuming you have other databases in this server apart from above mentioned one, Around 4 months ago we faced same issue in one of our on-premise replica set deployment which results in delayed checkpointing time, occasional operation failures and slow operation times.

You can refer dhandle details in below document:
dhandle details

We reached out to Mongo support team and found out that the ideal dhandle count should be 20,000, however we had around 3 lakhs, you can use below command to get active dhandles in your server.

db.serverStatus()['wiredTiger']['data-handle']['connection data handles currently active']

Moreover as @Gareth_Furnell told it could slow down write operations as it may need to update existing indexes with incoming data. On top of data archiving try to implement TTl indexes and try deleting unwanted collections and Indexes, if those were not needed.