How to Archive Data to Cloud Object Storage with MongoDB Online Archive

Maxime Beugnet9 min read • Published Feb 07, 2022 • Updated Sep 09, 2024

Atlas Data Federation JavaScript

Rate this tutorial

MongoDB Atlas Online Archive is a new feature of the MongoDB Cloud Data Platform. It allows you to set a rule to automatically archive data off of your Atlas cluster to fully-managed cloud object storage. In this blog post, I'll demonstrate how you can use Online Archive to tier your data for a cost-effective data management strategy.

The MongoDB Cloud data platform with Atlas Data Federation provides a serverless and scalable Federated Database Instance which allows you to natively query your data across cloud object storage and MongoDB Atlas clusters in-place.

In this blog post, I will use one of the MongoDB Open Data COVID-19 time series collections to demonstrate how you can combine Online Archive and Atlas Data Federation to save on storage costs while retaining easy access to query all of your data.

Prerequisites

For this tutorial, you will need:

a MongoDB Atlas M10 cluster or higher as Online Archive is currently not available for the shared tiers,
MongoDB Compass or MongoDB Shell to access your cluster.

Let's get some data

To begin with, let's retrieve a time series collection. For this tutorial, I will use one of the time series collections that I built for the MongoDB Open Data COVID19 project.

The covid19.global_and_us collection is the most complete COVID-19 times series in our open data cluster as it combines all the data that JHU keeps into separated CSV files.

As I would like to retrieve the entire collection and its indexes, I will use mongodump.

1 mongodump --uri="mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19" --collection='global_and_us'

This will create a dump folder in your current directory. Let's now import this collection in our cluster.

1 mongorestore --uri="mongodb+srv://<USER>:<PASSWORD@clustername.1a2bc.mongodb.net"

Now that our time series collection is here, let's see what a document looks like:

1 {
2   "_id": {
3     "$oid": "5f077868c3bda701aca1a3a7"
4   },
5   "uid": 175,
6   "country_iso2": "YT",
7   "country_iso3": "MYT",
8   "country_code": 175,
9   "state": "Mayotte",
10   "country": "France",
11   "combined_name": "Mayotte, France",
12   "population": 272813,
13   "loc": {
14     "type": "Point",
15     "coordinates": [
16       45.1662,
17       -12.8275
18     ]
19   },
20   "date": {
21     "$date": "2020-06-03T00:00:00.000Z"
22   },
23   "confirmed": 1993,
24   "deaths": 24,
25   "recovered": 1523
26 }

Note here that the date field is an IsoDate in extended JSON relaxed notation.

This time series collection is fairly simple. For each day and each place, we have a measurement of the number of confirmed, deaths and recovered if it's available. More details in our documentation.

What's the problem?

Problem is, it's a time series! So each day, we add a new entry for each place in the world and our collection will get bigger and bigger every single day. But as time goes on, it's likely that the older data is less important and less frequently accessed so we could benefit from archiving it off of our Atlas cluster.

Today, July 10th 2020, this collection contains 599760 documents which correspond to 3528 places, time 170 days and it's only 181.5 MB thanks to WiredTiger compression algorithm.

While this would not really be an issue with this trivial example, it will definitely force you to upgrade your MongoDB Atlas cluster to a higher tier if an extra GB of data was going in your cluster each day.

Upgrading to a higher tier would cost more money and maybe you don't need to keep all this cold data in your cluster.

Online Archive to the Rescue!

Manually archiving a subset of this dataset is tedious. I actually wrote a blog post about this.

It works, but you will need to extract and remove the documents from your MongoDB Atlas cluster yourself and then use the new $out operator to write your documents to cloud object storage - Amazon S3 or Microsoft Azure Blob Storage.

Lucky for you, MongoDB Atlas Online Archive does this for you automatically!

Let's head to MongoDB Atlas and click on our cluster to access our cluster details. Currently, Online Archive is not set up on this cluster.

Now let's click on Online Archive then Configure Online Archive.

The next page will give you some information and documentation about MongoDB Atlas Online Archive and in the next step you will have to configure your archiving rule.

In our case, it will look like this:

As you can see, I'm using the date field I mentioned above and if this document is more than 60 days old, it will be automatically moved to my cloud object storage for me.

Now, for the next step, I need to think about my access pattern. Currently, I'm using this dataset to create awesome COVID-19 charts.

And each time, I have to first filter by date to reduce the size of my chart and then optionally I filter by country then state if I want to zoom on a particular country or region.

As these fields will convert into folder names into my cloud object storage, they need to exist in all the documents. It's not the case for the field "state" because some countries don't have sub-divisions in this dataset.

As the date is always my first filter, I make sure it's at the top. Folders will be named and organised this way in my cloud object storage and folders that don't need to be explored will be eliminated automatically to speed up the data retrieval process.

Finally, before starting the archiving process, there is a final step: making sure Online Archive can efficiently find and remove the documents that need to be archived.

I already have a few indexes on this collection, let's see if this is really needed. Here are the current indexes:

As we can see, I don't have the recommended index. I have its opposite: {country: 1, date: 1} but they are not equivalent. Let's see how this query behaves in MongoDB Compass.

We can note several things in here:

We are using the date index. Which is a good news, at least it's not a collection scan!
The final sort operation is { date: 1, country: 1}
Our index {date:1} doesn't contain the information about country so an in-memory sort is required.
Wait a minute... Why do I have 0 documents returned?!

I have 170 days of data. I'm filtering all the documents older than 60 days so I should match 3528 places * 111 days = 391608 documents.

111 days (not 170-60=110) because we are July 10th when I'm writing this and I don't have today's data yet.

When I check the raw json output in Compass, I actually see that an error has occurred.

Sadly, it's trimmed. Let's run this again in the new mongosh to see the complete error:

1 errorMessage: 'Exec error resulting in state FAILURE :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.'

I ran out of RAM...oops! I have a few other collections in my cluster and the 2GB of RAM of my M10 cluster are almost maxed out.

In-memory sorts actually use a lot of RAM and if you can avoid these, I would definitely recommend that you get rid of them. They are forcing some data from your working set out of your cache and that will result in cache pressure and more IOPS.

Let's create the recommended index and see how the situation improves:

1 db.global_and_us.createIndex({ date: 1, country: 1})

Let's run our query again in the Compass explain plan:

This time, in-memory sort is no longer used, as we can return documents in the same order they appear in our index. 391608 documents are returned and we are using the correct index. This query is MUCH more memory efficient than the previous one.

Now that our index is created, we can finally start the archiving process.

Just before we start our archiving process, let's run an aggregation pipeline in MongoDB Compass to check the content of our collection.

1 [
2   {
3     '$sort': {
4       'date': 1
5     }
6   }, {
7     '$group': {
8       '_id': {
9         'country': '$country',
10         'state': '$state',
11         'county': '$county'
12       },
13       'count': {
14         '$sum': 1
15       },
16       'first_date': {
17         '$first': '$date'
18       },
19       'last_date': {
20         '$last': '$date'
21       }
22     }
23   }, {
24     '$count': 'number_places'
25   }
26 ]

As you can see, by grouping the documents by country, state and county, we can see:

how many days are reported: 170,
the first date: 2020-01-22T00:00:00.000+00:00,
the last date: 2020-07-09T00:00:00.000+00:00,
the number of places being monitored: 3528.

Once started, your Online Archive will look like this:

When the initialisation is done, it will look like this:

After some times, all your documents will be migrated in the underlying cloud object storage.

In my case, as I had 599760 in my collection and 111 days have been moved to my cloud object storage, I have 599760 - 111 * 3528 = 208152 documents left in my collection in MongoDB Atlas.

1 PRIMARY> db.global_and_us.count()
2 208152

Good. Our data is now archived and we don't need to upgrade our cluster to a higher cluster tier!

How to access my archived data?

Usually, archiving data rhymes with "bye bye data". The minute you decide to archive it, it's gone forever and you just take it out of the old dusty storage system when the actual production system just burnt to the ground.

Let me show you how you can keep access to the ENTIRE dataset we just archived on my cloud object storage using MongoDB Atlas Data Federation.

First, let's click on the CONNECT button. Either directly in the Online Archive tab:

Or head to the Data Federation menu on the left to find your automatically configured environment.

Retrieve the connection command line for the Mongo Shell:

Make sure you replace the database and the password in the command. Once you are connected, you can run the following aggregation pipeline:

1 [
2   {
3     '$match': {
4       'country': 'France'
5     }
6   }, {
7     '$sort': {
8       'date': 1
9     }
10   }, {
11     '$group': {
12       '_id': '$uid',
13       'first_date': {
14         '$first': '$date'
15       },
16       'last_date': {
17         '$last': '$date'
18       },
19       'count': {
20         '$sum': 1
21       }
22     }
23   }
24 ]

And here is the same query in command line - easier for a quick copy & paste.

1 db.global_and_us.aggregate([ { '$match': { 'country': 'France' } }, { '$sort': { 'date': 1 } }, { '$group': { '_id': { 'country': '$country', 'state': '$state', 'county': '$county' }, 'first_date': { '$first': '$date' }, 'last_date': { '$last': '$date' }, 'count': { '$sum': 1 } } } ])

Here is the result I get:

1 { "_id" : { "country" : "France", "state" : "Reunion" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
2 { "_id" : { "country" : "France", "state" : "Saint Barthelemy" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
3 { "_id" : { "country" : "France", "state" : "Martinique" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
4 { "_id" : { "country" : "France", "state" : "Mayotte" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
5 { "_id" : { "country" : "France", "state" : "French Guiana" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
6 { "_id" : { "country" : "France", "state" : "Guadeloupe" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
7 { "_id" : { "country" : "France", "state" : "New Caledonia" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
8 { "_id" : { "country" : "France", "state" : "St Martin" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
9 { "_id" : { "country" : "France" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
10 { "_id" : { "country" : "France", "state" : "Saint Pierre and Miquelon" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
11 { "_id" : { "country" : "France", "state" : "French Polynesia" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }

As you can see, even if our cold data is archived, we can still access our ENTIRE dataset even though it was partially archived. The first date is still January 22nd and the last date is still July 9th for a total of 170 days.

Wrap Up

MongoDB Atlas Online Archive is your new best friend to retire and store your cold data safely in cloud object storage with just a few clicks.

In this tutorial, I showed you how to set up an Online Archive to automatically archive your data to fully-managed cloud object storage while retaining easy access to query the entirety of the dataset in-place, across sources, using Atlas Data Federation.

Just in case this blog post didn't make it clear, Online Archive is NOT a replacement for backups or a backup strategy. These are 2 completely different topics and they should not be confused.

If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

To learn more about MongoDB Atlas Data Federation, read the other blogs posts in this series below, or check out the documentation.

Rate this tutorial

Tutorial

How to Build an Animated Timeline Chart with the MongoDB Charts Embedding SDK

Dec 13, 2023 | 6 min read

Tutorial

Boosting AI: Build Your Chatbot Over Your Data With MongoDB Atlas Vector Search and LangChain Templates Using the RAG Pattern

Sep 18, 2024 | 7 min read

Article

How to Easily Pause and Resume MongoDB Atlas Clusters

Sep 11, 2024 | 5 min read

Tutorial

Part #2: Create Your Model Endpoint With Amazon SageMaker, AWS Lambda, and AWS API Gateway

Sep 18, 2024 | 7 min read

Prerequisites
Let's get some data
What's the problem?
Online Archive to the Rescue!
How to access my archived data?
Wrap Up

Atlas

How to Archive Data to Cloud Object Storage with MongoDB Online Archive

Prerequisites

Let's get some data

What's the problem?

Online Archive to the Rescue!

How to access my archived data?

Wrap Up

Related

How to Build an Animated Timeline Chart with the MongoDB Charts Embedding SDK

Boosting AI: Build Your Chatbot Over Your Data With MongoDB Atlas Vector Search and LangChain Templates Using the RAG Pattern

How to Easily Pause and Resume MongoDB Atlas Clusters

Part #2: Create Your Model Endpoint With Amazon SageMaker, AWS Lambda, and AWS API Gateway

Table of Contents

1	{
2	"_id": {
3	"$oid": "5f077868c3bda701aca1a3a7"
4	},
5	"uid": 175,
6	"country_iso2": "YT",
7	"country_iso3": "MYT",
8	"country_code": 175,
9	"state": "Mayotte",
10	"country": "France",
11	"combined_name": "Mayotte, France",
12	"population": 272813,
13	"loc": {
14	"type": "Point",
15	"coordinates": [
16	45.1662,
17	-12.8275
18	]
19	},
20	"date": {
21	"$date": "2020-06-03T00:00:00.000Z"
22	},
23	"confirmed": 1993,
24	"deaths": 24,
25	"recovered": 1523
26	}

1	[
2	{
3	'$sort': {
4	'date': 1
5	}
6	}, {
7	'$group': {
8	'_id': {
9	'country': '$country',
10	'state': '$state',
11	'county': '$county'
12	},
13	'count': {
14	'$sum': 1
15	},
16	'first_date': {
17	'$first': '$date'
18	},
19	'last_date': {
20	'$last': '$date'
21	}
22	}
23	}, {
24	'$count': 'number_places'
25	}
26	]

1	[
2	{
3	'$match': {
4	'country': 'France'
5	}
6	}, {
7	'$sort': {
8	'date': 1
9	}
10	}, {
11	'$group': {
12	'_id': '$uid',
13	'first_date': {
14	'$first': '$date'
15	},
16	'last_date': {
17	'$last': '$date'
18	},
19	'count': {
20	'$sum': 1
21	}
22	}
23	}
24	]

1	{ "_id" : { "country" : "France", "state" : "Reunion" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
2	{ "_id" : { "country" : "France", "state" : "Saint Barthelemy" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
3	{ "_id" : { "country" : "France", "state" : "Martinique" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
4	{ "_id" : { "country" : "France", "state" : "Mayotte" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
5	{ "_id" : { "country" : "France", "state" : "French Guiana" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
6	{ "_id" : { "country" : "France", "state" : "Guadeloupe" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
7	{ "_id" : { "country" : "France", "state" : "New Caledonia" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
8	{ "_id" : { "country" : "France", "state" : "St Martin" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
9	{ "_id" : { "country" : "France" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
10	{ "_id" : { "country" : "France", "state" : "Saint Pierre and Miquelon" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }
11	{ "_id" : { "country" : "France", "state" : "French Polynesia" }, "first_date" : ISODate("2020-01-22T00:00:00Z"), "last_date" : ISODate("2020-07-09T00:00:00Z"), "count" : 170 }