How to Get MongoDB Data into Parquet in 10 Seconds or Less
Rate this tutorial
For those of you not familiar with Parquet, it’s an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.
One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with MongoDB Atlas Data Federation's $out to cloud object storage - Amazon S3 or Microsoft Azure Blob Storage, you can now convert MongoDB Data into Parquet with little effort.
In this blog post, I’m going to walk you through the steps necessary to write data from your Atlas Cluster directly to cloud object storage in the Parquet format and then finish up by reviewing some things to keep in mind when using Parquet with NoSQL data. I’m going to use a sample data set that contains taxi ride data from New York City.
In order to follow along with this tutorial yourself, you will need the following:
An Atlas cluster with some data in it. (It can be the sample data.)
An AWS account with privileges to create IAM Roles and cloud object storage buckets (to give us access to write data to your cloud object storage bucket).
The first thing you'll need to do is navigate to the "Data Federation" tab on the left hand side of your Atlas Dashboard and then click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.
Then, you need to connect your cloud object storage bucket to your Federated Database Instance. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly but you will need access to your credentials for AWS. (Be sure to give Atlas Data Federation “Read and Write” access to the bucket so it can write the Parquet files there.)
Once you’ve connected your cloud object storage bucket, we’re going to create a simple data source to query the data in cloud object storage so we can verify we’ve written the data to cloud object storage at the end of this tutorial. Our new setup tool makes it easier than ever to configure your Federated Database Instance to take advantage of the partitioning of data in cloud object storage. Partitioning allows us to only select the relevant data to process in order to satisfy your query. (I’ve put a sample file in there for this test that will fit how we’re going to partition the data by _cab_type).
1 mongoimport --uri mongodb+srv://<USERNAME>:<PASSWORD>@<MONGODB_URI>/<DATABASE> --collection <COLLECTION> --type json --file <FILENAME>
Now we’re going to connect our Atlas cluster, so we can write data from it into the Parquet files. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you’d like to create Data Sources from and dragging them into your Federated Database Instance.
Now we’re going to connect to our Federated Database Instance using the mongo shell and execute the following command. This is going to do quite a few things, so I’m going to explain the important ones.
- First, you can use the ‘filename’ field of the $out stage to have your Federated Database Instance partition files by “_cab_type”, so all the green cabs will go in one set of files and all the yellow cabs will go in another.
- Then in the format, we’re going to specify parquet and determine a maxFileSize and maxRowGroupSize. -- maxFileSize is going to determine the maximum size each partition will be. -- maxRowGroupSize is going to determine how records are grouped inside of the Parquet file in “row groups” which will impact performance querying your Parquet files, similarly to file size.
- Lastly, we’re using a special Atlas Data Federation aggregation “background: true” which simply tells the Federated Database Instance to keep executing the query even if the client disconnects. (This is handy for long running queries or environments where your network connection is not stable.)
1 db.getSiblingDB("clusterData").getCollection("trips").aggregate([ 2 { 3 "$out" : { 4 "s3" : { 5 "bucket" : "ben.flast", 6 "region" : "us-east-1", 7 "filename" : { 8 "$concat" : [ 9 "taxi-trips/", 10 "$_cab_type", 11 "/" 12 ] 13 }, 14 "format" : { 15 "name" : "parquet", 16 "maxFileSize" : "10GB", 17 "maxRowGroupSize" : "100MB" 18 } 19 } 20 } 21 } 22 ], { 23 background: true 24 })
Now, to give you some idea of the potential performance improvements for Object Store Data you can see, I’ve written three sets of data, each with 10 million documents: one in Parquet, one in uncompressed JSON, and another in compressed JSON. And I ran a count command on each of them with the following results.
db.trips.count()
10,000,000
Type | Data Size (GB) | Count Command Latency (Seconds) |
---|---|---|
JSON (Uncompressed) | ~16.1 | 297.182 |
JSON (Compressed) | ~1.1 | 78.070 |
Parquet | ~1.02 | 1.596 |
So, what have we done and what have we learned?
- We saw how quickly and easily you can create a Federated Database Instance in MongoDB Atlas.
- We connected an Atlas cluster to our Federated Database Instance.
- We used our Federated Database Instance to write Atlas cluster data to cloud object storage in Parquet format.
- We demonstrated how fast and space-efficient Parquet is when compared to JSON.
- Parquet is a super fast columnar format that can be read and written with Atlas Data Federation.
- Atlas Data Federation takes advantage of various pieces of metadata contained in Parquet files, not just the maxRowGroupSize. For instance, if your first stage in an aggregation pipeline was $project: {fieldA: 1, filedB: 1}, we would only read the two columns from the Parquet file which results in faster performance and lower costs as we are scanning less data.
- Atlas Data Federation writes Parquet files flexibly so if you have polymorphic data, we will create union columns so you can have ‘Column A - String’ and ‘Column A - Int’. Atlas Data Federation will read union columns back in as one field but other tools may not handle union types. So if you’re going to be using these Parquet files with other tools, you should transform your data before the $out stage to ensure no union columns.
- Atlas Data Federation will also write files with different schemas if it encounters data with varying schemas throughout the aggregation. It can handle different schemas across files in one collection, but other tools may require a consistent schema across files. So if you’re going to be using these Parquet files with other tools, you should do a $project with $convert’s before the $out stage to ensure a consistent schema across generated files.
- Parquet is a great format for your MongoDB data when you need to use columnar oriented tools like Tableau for visualizations or machine learning frameworks that use data frames. Parquet can be quickly and easily converted into Pandas data frames in Python.