Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Orchestrating MongoDB & BigQuery for ML Excellence with PyMongoArrow and BigQuery Pandas Libraries

Venkatesh Shanbhag, Maruti C4 min read • Published Feb 07, 2024 • Updated Feb 08, 2024
PandasGoogle CloudAIPythonMongoDB
SNIPPET
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
In today's data-driven world, the ability to analyze and efficiently move data across different platforms is crucial. MongoDB Atlas and Google BigQuery are two powerful platforms frequently used for managing and analyzing data. While they excel in their respective domains, connecting and transferring data between them seamlessly can pose challenges. However, with the right tools and techniques, this integration becomes not only possible but also streamlined.
One effective way to establish a smooth pipeline between MongoDB Atlas and BigQuery is by leveraging PyMongoArrow and pandas-gbq, two powerful Python libraries that facilitate data transfer and manipulation. PyMongoArrow acts as a bridge between MongoDB and Arrow, a columnar in-memory analytics layer, enabling efficient data conversion. On the other hand, pandas-gbq is a Python client library for Google BigQuery, allowing easy interaction with BigQuery datasets.
Image 1: Architecture diagram for MongoDB interaction with BigQuery and VertexAI using python libraries

Advantages of a Python-based solution

  1. Easily implement movement of a wide range of datatypes between MongoDB and BigQuery.
  2. Easily join multiple data sources like cloud storage, Google databases, MongoDB Atlas etc. and transform the data using pandas dataframes.
  3. You can use your favorite notebook to build the solution, including the new preview notebook available in BigQuery Studio.
  4. Perform exploratory data analysis on data read from both Google BigQuery and MongoDB Atlas platforms without physically moving the data between these platforms. This will simplify the effort required by data engineers to move the data and offers a faster way for data scientists to build machine learning (ML) models.
Let's discuss each of the implementation advantages with examples.

ETL data from MongoDB to BigQuery

Let’s consider a sample shipwreck dataset available on MongoDB Atlas for this use case.
Use the commands below to install the required libraries on the notebook environment of your choice. For easy and scalable setup, use BigQuery Jupyter notebooks or managed VertexAI workbench notebooks.
Image 2: screenshot of jupyter notebook for BigQuery DataFrames implementation.
1!pip install --upgrade pip
2!pip install pymongoarrow
3!pip install pandas-gbq
First, establish a connection to your MongoDB Atlas cluster using PyMongoArrow. This involves configuring authentication and selecting the database and collection from which you want to transfer data. Follow MongoDB Atlas documentation for setting up your cluster, network access, and authentication. Load a sample dataset to your Atlas cluster. Get the Atlas connection string and replace the URI string below with your connection string. The below script is also available in the GitHub repository with steps to set up.
1#Read data from MongoDB
2import certifi
3import pprint
4import pymongo
5import pymongoarrow
6from pymongo import MongoClient
7
8client = MongoClient("URI ``sting``",tlsCAFile=certifi.where())
9
10#Initialize database and collection
11db = client.get_database("sample_geospatial")
12col = db.get_collection("shipwrecks")
13
14for doc in col.find({}):
15  pprint.pprint(doc)
16
17from pymongoarrow.monkey import patch_all
18patch_all()
19
20#Create Dataframe for data read from MongoDB
21import pandas as pd
22df = col.find_pandas_all({})
Transform the data to the required format — e.g., transform and remove the unsupported data formats, like the MongoDB object ID, or convert the MongoDB object to JSON before writing it to BigQuery. Please refer to the documentation to learn more about data types supported by pandas-gbq and PyMongoArrow.
1#Transform the schema for required format.
2#e.g. the object id is not supported in dataframes can be removed or converted to string.
3
4del(df["_id"])
Once you have retrieved data from MongoDB Atlas and converted it into a suitable format using PyMongoArrow, you can proceed to transfer it to BigQuery using either the pandas-gbq or google-cloud-bigquery. In this article, we are using pandas-gbq. Refer to the documentation for more details on the differences between pandas-gbq and google-cloud-bigquery libraries. Ensure you have a dataset in BigQuery to which you want to load the MongoDB data. You can create a new dataset or use an existing one.
1#Write the transformed data to BigQuery.
2
3import pandas_gbq
4
5pandas_gbq.to_gbq(df[0:100], "gcp-project-name.bigquery-dataset-name.bigquery-table-name", project_id="gcp-project-name")
As you embark on building your pipeline, optimizing the data transfer process between MongoDB Atlas and BigQuery is essential for performance. A few points to consider:
  1. Batch Dataframes in to chunks, especially when dealing with large datasets, to prevent memory issues.
  2. Handle schema mapping and data type conversions properly to ensure compatibility between the source and destination databases.
  3. With the right tools like Google colab, VertexAI Workbench etc, this pipeline can become a cornerstone of your data ecosystem, facilitating smooth and reliable data movement between MongoDB Atlas and Google BigQuery.

Introduction to Google BigQuery DataFrames (bigframes)

Google bigframes is a Python API that provides a pandas-compatible DataFrame and machine learning capabilities powered by the BigQuery engine. It provides a familiar pandas interface for data manipulation and analysis. Once the data from MongoDB is written into BigQuery, the BigQuery DataFrames can unlock the user-friendly solution for analyzing petabytes of data with ease. The pandas DataFrame can be read directly into BigQuery DataFrames using the Python bigframes.pandas library. Install the bigframes library to use BigQuery DataFrames.
1!pip install bigframes
Before reading the pandas DataFrames into BigQuery DataFrames, rename the columns as per Google's schema guidelines. (Please note that at the time of publication, the feature may not be GA).
1import bigframes.pandas as bpd
2bigframes.options.bigquery.project = "GCP project ID"
3
4# df = <pandas_df>
5bdf = bpd.read_pandas(df)
For more information on using Google Cloud Bigquery DataFrames, visit the Google Cloud documentation.

Conclusion

Creating a robust pipeline between MongoDB Atlas and BigQuery using PyMongoArrow and pandas-gbq opens up a world of possibilities for efficient data movement and analysis. This integration allows for the seamless transfer of data, enabling organizations to leverage the strengths of both platforms for comprehensive data analytics and decision-making.

Further reading

Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Create a Data Pipeline for MongoDB Change Stream Using Pub/Sub BigQuery Subscription


Apr 02, 2024 | 5 min read
Article

Case-Insensitive Queries Without Case-Insensitive Indexes


Oct 01, 2024 | 8 min read
Quickstart

How to Build a CRUD Application With MongoDB, Quarkus, and GraalVM


Aug 29, 2024 | 7 min read
Tutorial

Written in the Stars: Predict Your Future With Tensorflow and MongoDB Charts


Aug 21, 2024 | 15 min read
Table of Contents
  • Advantages of a Python-based solution