Atlas archiving

Hello everyone, our team currently uses atlas as main db and we have a lot of useless data. I am considering a different options and have some questions:

  1. Do I understand correctly that using the online archive approach will be cheaper than the approach where we move useless data to a cold cluster?
  2. We already have a huge amount of useless data, would it be better if we run some process that archives it first and then set up an scheduled online archive?
  3. Does online archive uses S3 Glacier or does it just use default S3 storage?

Hi @Bohdan_Chystiakov ,

Great questions ! Here are the responses:

  1. Yes, Online Archive is easier and cheaper as archives are fully managed so you don’t need to configure and maintain separate cloud object storage. You can create date-based or custom archival rules that run indefinitely so you don’t have to manually migrate or delete data. In addition, your archives are online and easily accessible alongside your Atlas cluster data.
  2. Great question ! For the first time archival, it is better to run for more time and if you can dedicate exclusive time, it is better to run 24 hours first and then setup scheduled archives. When you are running for the first time someone on your cluster and you might want to archive a massive chunk of data. OA speed limit is 2GB/5 mins, that is around 0.5 TB/day. So, if you have 2 TBs of data to archive, you can let the archive run continuously for 24 hours and it will archive in about 4 days. After the first time archival, your daily archival workloads can be scheduled to run during off peak hours with the “Scheduled archiving” feature.
  3. Online Archive uses the standard S3 storage (not Glacier).
2 Likes

Here are a few additional questions, if I may:

  1. Is it possible to set up our own S3 bucket and use it with Online Archive approach or is it completely managed by Atlas?
  2. If we want to abandon Atlas, can we keep the archive data?
  3. Is there any way to return the data from Online Archive to the old cluster?

Hi @Bohdan_Chystiakov

  1. Online Archive is completely managed by Atlas. If you want to setup your own S3 bucket and query it, you can definitely do this using Data Federation : Atlas Data Federation | MongoDB
  2. No, Online Archive is tied to Atlas
  3. Yes, if you want to restore the data back from Online Archive to your cluster, you can use the $merge : https://www.mongodb.com/docs/atlas/online-archive/restore-archived-data/
1 Like

Hi @Prem_PK_Krishna, thanks for answering these questions. My team is also exploring some solutions for mongoDb data archival and online archive seems a great option for it. But before making a conclusive decision I have some queries, would love to get some response/feedback from an expert.

  1. In limitations it says at max 50 OAs can be created per cluster and out of those 20 will be active at a time. Can you explain this in bit detail? Does that mean if someone does data retrival on those inactive(30 OAs) the query will fail or result in inaccurate data? Also it this limit customisable, can we request for more active OAs or More OA creation as per the requirements ?
  2. How much latency we can expect in data retrieval ? if it is minimal then its fine.
  3. Also can we configure online archival for a collection while defining the schema at codebase level ? I mean is there any way I can provide a field mentioning make a archival for this collection with specified rules.
  4. Can you please provide some more detailed documentation on working with OAs through CLI and APIs, and how can I setup OAs for multiple collections within a cluster with exactly same rules, using some script or something else, at application startup.
  5. In notes multiple times its mentioned that it will only archive data if it more that 5MiB for last 7 days, what exactly does this mean ? Like suppose my rule is set for 60 days, and we don’t have more that 5MiB data in this cycle, so will this data be archived in next cycle ?
  6. What are the best practices to manage archived data, eventually it can grow and become huge at a time, also deleting this archived data is not an option. Do we have something to solve this ? like may be partitioning 1 OA into multiples etc.

Lots of questions :smile: , but this will help us to be clear and accurate regarding the solution which can be used in long terms.

Anyone having these questions, they have responded here with answers :