Orchestrating MongoDB & BigQuery for ML Excellence with PyMongoArrow and BigQuery Pandas Libraries
Venkatesh Shanbhag, Maruti C4 min read • Published Feb 07, 2024 • Updated Feb 08, 2024
SNIPPET
Rate this tutorial
In today's data-driven world, the ability to analyze and efficiently move data across different platforms is crucial. MongoDB Atlas and Google BigQuery are two powerful platforms frequently used for managing and analyzing data. While they excel in their respective domains, connecting and transferring data between them seamlessly can pose challenges. However, with the right tools and techniques, this integration becomes not only possible but also streamlined.
One effective way to establish a smooth pipeline between MongoDB Atlas and BigQuery is by leveraging PyMongoArrow and pandas-gbq, two powerful Python libraries that facilitate data transfer and manipulation. PyMongoArrow acts as a bridge between MongoDB and Arrow, a columnar in-memory analytics layer, enabling efficient data conversion. On the other hand, pandas-gbq is a Python client library for Google BigQuery, allowing easy interaction with BigQuery datasets.
- Easily implement movement of a wide range of datatypes between MongoDB and BigQuery.
- Easily join multiple data sources like cloud storage, Google databases, MongoDB Atlas etc. and transform the data using pandas dataframes.
- You can use your favorite notebook to build the solution, including the new preview notebook available in BigQuery Studio.
- Perform exploratory data analysis on data read from both Google BigQuery and MongoDB Atlas platforms without physically moving the data between these platforms. This will simplify the effort required by data engineers to move the data and offers a faster way for data scientists to build machine learning (ML) models.
Let's discuss each of the implementation advantages with examples.
Use the commands below to install the required libraries on the notebook environment of your choice. For easy and scalable setup, use BigQuery Jupyter notebooks or managed VertexAI workbench notebooks.
1 !pip install --upgrade pip 2 !pip install pymongoarrow 3 !pip install pandas-gbq
First, establish a connection to your MongoDB Atlas cluster using PyMongoArrow. This involves configuring authentication and selecting the database and collection from which you want to transfer data. Follow MongoDB Atlas documentation for setting up your cluster, network access, and authentication. Load a sample dataset to your Atlas cluster. Get the Atlas connection string and replace the URI string below with your connection string. The below script is also available in the GitHub repository with steps to set up.
1 #Read data from MongoDB 2 import certifi 3 import pprint 4 import pymongo 5 import pymongoarrow 6 from pymongo import MongoClient 7 8 client = MongoClient("URI ``sting``",tlsCAFile=certifi.where()) 9 10 #Initialize database and collection 11 db = client.get_database("sample_geospatial") 12 col = db.get_collection("shipwrecks") 13 14 for doc in col.find({}): 15 pprint.pprint(doc) 16 17 from pymongoarrow.monkey import patch_all 18 patch_all() 19 20 #Create Dataframe for data read from MongoDB 21 import pandas as pd 22 df = col.find_pandas_all({})
Transform the data to the required format — e.g., transform and remove the unsupported data formats, like the MongoDB object ID, or convert the MongoDB object to JSON before writing it to BigQuery. Please refer to the documentation to learn more about data types supported by pandas-gbq and PyMongoArrow.
1 #Transform the schema for required format. 2 #e.g. the object id is not supported in dataframes can be removed or converted to string. 3 4 del(df["_id"])
Once you have retrieved data from MongoDB Atlas and converted it into a suitable format using PyMongoArrow, you can proceed to transfer it to BigQuery using either the pandas-gbq or google-cloud-bigquery. In this article, we are using pandas-gbq. Refer to the documentation for more details on the differences between pandas-gbq and google-cloud-bigquery libraries. Ensure you have a dataset in BigQuery to which you want to load the MongoDB data. You can create a new dataset or use an existing one.
1 #Write the transformed data to BigQuery. 2 3 import pandas_gbq 4 5 pandas_gbq.to_gbq(df[0:100], "gcp-project-name.bigquery-dataset-name.bigquery-table-name", project_id="gcp-project-name")
As you embark on building your pipeline, optimizing the data transfer process between MongoDB Atlas and BigQuery is essential for performance. A few points to consider:
- Batch Dataframes in to chunks, especially when dealing with large datasets, to prevent memory issues.
- Handle schema mapping and data type conversions properly to ensure compatibility between the source and destination databases.
- With the right tools like Google colab, VertexAI Workbench etc, this pipeline can become a cornerstone of your data ecosystem, facilitating smooth and reliable data movement between MongoDB Atlas and Google BigQuery.
Google bigframes is a Python API that provides a pandas-compatible DataFrame and machine learning capabilities powered by the BigQuery engine. It provides a familiar pandas interface for data manipulation and analysis. Once the data from MongoDB is written into BigQuery, the BigQuery DataFrames can unlock the user-friendly solution for analyzing petabytes of data with ease. The pandas DataFrame can be read directly into BigQuery DataFrames using the Python bigframes.pandas library. Install the bigframes library to use BigQuery DataFrames.
1 !pip install bigframes
Before reading the pandas DataFrames into BigQuery DataFrames, rename the columns as per Google's schema guidelines. (Please note that at the time of publication, the feature may not be GA).
1 import bigframes.pandas as bpd 2 bigframes.options.bigquery.project = "GCP project ID" 3 4 # df = <pandas_df> 5 bdf = bpd.read_pandas(df)
For more information on using Google Cloud Bigquery DataFrames, visit the Google Cloud documentation.
Creating a robust pipeline between MongoDB Atlas and BigQuery using PyMongoArrow and pandas-gbq opens up a world of possibilities for efficient data movement and analysis. This integration allows for the seamless transfer of data, enabling organizations to leverage the strengths of both platforms for comprehensive data analytics and decision-making.
Top Comments in Forums
There are no comments on this article yet.