Docs Menu
Docs Home
/ / /
PyMongoArrow

Quick Start

On this page

  • Prerequisites
  • Extending PyMongo
  • Test Data
  • Defining the Schema
  • Find Operations
  • Aggregate Operations
  • Writing to MongoDB
  • Ignore Empty Values
  • Writing to Other Formats

This tutorial is intended as an introduction to working with PyMongoArrow. The tutorial assumes the reader is familiar with basic PyMongo and MongoDB concepts.

Ensure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:

>>> import pymongoarrow as pma

This tutorial also assumes that a MongoDB instance is running on the default host and port. After you have downloaded and installed MongoDB, you can start it as shown in the following code example:

$ mongod

The pymongoarrow.monkey module provides an interface to patch PyMongo in place, and add PyMongoArrow functionality directly to Collection instances:

from pymongoarrow.monkey import patch_all
patch_all()

After you run the monkey.patch_all() method, new instances of the Collection class will contain the PyMongoArrow APIs-- for example, the pymongoarrow.api.find_pandas_all() method.

Note

You can also use any of the PyMongoArrow APIs by importing them from the pymongoarrow.api module. If you do, you must pass the instance of the Collection on which the operation is to be run as the first argument when calling the API method.

The following code uses PyMongo to add sample data to your cluster:

from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
client.db.data.insert_many([
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])

PyMongoArrow relies on a data schema to marshall query result sets into tabular form. If you don't provide this schema, PyMongoArrow infers one from the data. You can define the schema by creating a Schema object and mapping the field names to type-specifiers, as shown in the following example:

from pymongoarrow.api import Schema
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})

MongoDB uses embedded documents to represent nested data. PyMongoArrow offers first-class support for these documents:

schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})

PyMongoArrow also supports lists and nested lists:

from pyarrow import list_, string
schema = Schema({'txns': list_(string())})
polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)

Tip

PyMongoArrow includes multiple permissible type-identifiers for each supported BSON type. For a full list of these data types and their associated type-identifiers, see Data Types.

The following code example shows how to load all records that have a non-zero value for the amount field as a pandas.DataFrame object:

df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)

You can also load the same result set as a pyarrow.Table instance:

arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)

Or as a polars.DataFrame instance:

df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)

Or as a NumPy arrays object:

ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)

When using NumPy, the return value is a dictionary where the keys are field names and the values are the corresponding numpy.ndarray instances.

Note

In all of the preceding examples, you can omit the schema as shown in the following example:

arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})

If you omit the schema, PyMongoArrow tries to automatically apply a schema based on the data contained in the first batch.

Running an aggregate operation is similar to running a find operation, but it takes a sequence of operations to perform.

The following is a simple example of the aggregate_pandas_all() method that outputs a new dataframe in which all _id values are grouped together and their amount values summed:

df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])

You can also run aggregate operations on embedded documents. The following example unwinds values in the nested txn field, counts the number of each value, then returns the results as a list of NumPy ndarray objects, sorted in descending order:

pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
ndarrays = client.db.data.aggregate_numpy_all(pipeline)

Tip

For more information about aggregation pipelines, see the MongoDB Server documentation.

You can use the write() method to write objects of the following types to MongoDB:

  • Arrow Table

  • Pandas DataFrame

  • NumPy ndarray

  • Polars DataFrame

from pymongoarrow.api import write
from pymongo import MongoClient
coll = MongoClient().db.my_collection
write(coll, df)
write(coll, arrow_table)
write(coll, ndarrays)

Note

NumPy arrays are specified as dict[str, ndarray].

The write() method optionally accepts a boolean exclude_none parameter. If you set this parameter to True, the driver doesn't write empty values to the database. If you set this parameter to False or leave it blank, the driver writes None for each empty field.

The code in the following example writes an Arrow Table to MongoDB twice. One of the values in the 'b' field is set to None.

The first call to the write() method omits the exclude_none parameter, so it defaults to False. All values in the Table, including None, are written to the database. The second call to the write() method sets exclude_none to True, so the empty value in the 'b' field is ignored.

data_a = [1, 2, 3]
data_b = [1, None, 3]
data = Table.from_pydict(
{
"a": data_a,
"b": data_b,
},
)
coll.drop()
write(coll, data)
col_data = list(coll.find({}))
coll.drop()
write(coll, data, exclude_none=True)
col_data_exclude_none = list(coll.find({}))
print(col_data)
print(col_data_exclude_none)
{'_id': ObjectId('...'), 'a': 1, 'b': 1}
{'_id': ObjectId('...'), 'a': 2, 'b': None}
{'_id': ObjectId('...'), 'a': 3, 'b': 3}
{'_id': ObjectId('...'), 'a': 1, 'b': 1}
{'_id': ObjectId('...'), 'a': 2}
{'_id': ObjectId('...'), 'a': 3, 'b': 3}

Once result sets have been loaded, you can then write them to any format that the package supports.

For example, to write the table referenced by the variable arrow_table to a Parquet file named example.parquet, run the following code:

import pyarrow.parquet as pq
pq.write_table(arrow_table, 'example.parquet')

Pandas also supports writing DataFrame instances to a variety of formats, including CSV and HDF. To write the data frame referenced by the variable df to a CSV file named out.csv, run the following code:

df.to_csv('out.csv', index=False)

The Polars API is a mix of the two preceding examples:

import polars as pl
df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
df.write_parquet('example.parquet')

Note

Nested data is supported for parquet read and write operations, but is not well supported by Arrow or Pandas for CSV read and write operations.

Back

Installing and Upgrading