I am storing multi tenanted data in collections. Each collection has a tenantId. Sometimes, I will need to delete a tenant and that means I will need to run a query such as
db.records.deleteMany({ tenantId: "abc" })
However, the number of records for a tenant in some collections can reach even half a billion records and this is very slow, I cannot easily get progress too. tenantId is indexed. How best should I delete data in such instances?
Thinking if some of these are recommended to try?
Query X rows at a time, build an array of _ids which I will then use in delete. Repeat until no more rows
Maybe store data for each tenant in separate collections? But this will complicate query if I need to query across collections?
What if I bulk delete a deletedAt field that has a TTL index? Dunno if thats faster
Will try this. But the docs doesnt say much about how this works under the hood or hows its different apart from doing it with less load on the DB?
Bulk operations might be of help here. An unordered bulk.find(queryDoc).remove() basically is a version of db.collection.remove(queryDoc) optimized for large numbers of operations. It’s usage is pretty straightforward:
The idea behind this approach is not to speed up the removal, but to produce less load. In my tests, the load was reduced by half and took slightly less time than a db.collection.remove(query).
Another thing to keep in mind is that doing a large amount of delete may cause high network bandwidth usage and longer replication lag. (many deployments have alert on the replication lags)