Quick Start
On this page
This tutorial is intended as an introduction to working with PyMongoArrow. The tutorial assumes the reader is familiar with basic PyMongo and MongoDB concepts.
Prerequisites
Ensure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:
import pymongoarrow as pma
This tutorial also assumes that a MongoDB instance is running on the default host and port. After you have downloaded and installed MongoDB, you can start it as shown in the following code example:
$ mongod
Extending PyMongo
The pymongoarrow.monkey
module provides an interface to patch PyMongo
in place, and add PyMongoArrow functionality directly to
Collection
instances:
from pymongoarrow.monkey import patch_all patch_all()
After you run the monkey.patch_all()
method, new instances of
the Collection
class will contain the PyMongoArrow APIs--
for example, the pymongoarrow.api.find_pandas_all()
method.
Note
You can also use any of the PyMongoArrow APIs
by importing them from the pymongoarrow.api
module. If you do,
you must pass the instance of the Collection
on which the operation is to be
run as the first argument when calling the API method.
Test Data
The following code uses PyMongo to add sample data to your cluster:
from datetime import datetime from pymongo import MongoClient client = MongoClient() client.db.data.insert_many([ {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
Defining the Schema
PyMongoArrow relies on a data schema to marshall
query result sets into tabular form. If you don't provide this schema, PyMongoArrow
infers one from the data. You can define the schema by
creating a Schema
object and mapping the field names
to type-specifiers, as shown in the following example:
from pymongoarrow.api import Schema schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
MongoDB uses embedded documents to represent nested data. PyMongoArrow offers first-class support for these documents:
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
PyMongoArrow also supports lists and nested lists:
from pyarrow import list_, string schema = Schema({'txns': list_(string())}) polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
Tip
PyMongoArrow includes multiple permissible type-identifiers for each supported BSON type. For a full list of these data types and their associated type-identifiers, see Data Types.
Find Operations
The following code example shows how to load all records that have a non-zero
value for the amount
field as a pandas.DataFrame
object:
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
You can also load the same result set as a pyarrow.Table
instance:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
Or as a polars.DataFrame
instance:
df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
Or as a NumPy arrays
object:
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
When using NumPy, the return value is a dictionary where the keys are field
names and the values are the corresponding numpy.ndarray
instances.
Note
In all of the preceding examples, you can omit the schema as shown in the following example:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
If you omit the schema, PyMongoArrow tries to automatically apply a schema based on the data contained in the first batch.
Aggregate Operations
Running an aggregate operation is similar to running a find operation, but it takes a sequence of operations to perform.
The following is a simple example of the aggregate_pandas_all()
method that outputs a
new dataframe in which all _id
values are grouped together and their amount
values
summed:
df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
You can also run aggregate operations on embedded documents.
The following example unwinds values in the nested txn
field, counts the number of each
value, then returns the results as a list of NumPy ndarray
objects, sorted in
descending order:
pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}] ndarrays = client.db.data.aggregate_numpy_all(pipeline)
Tip
For more information about aggregation pipelines, see the MongoDB Server documentation.
Writing to MongoDB
You can use the write()
method to write objects of the following types to MongoDB:
Arrow
Table
Pandas
DataFrame
NumPy
ndarray
Polars
DataFrame
from pymongoarrow.api import write from pymongo import MongoClient coll = MongoClient().db.my_collection write(coll, df) write(coll, arrow_table) write(coll, ndarrays)
Note
NumPy arrays are specified as dict[str, ndarray]
.
Ignore Empty Values
The write()
method optionally accepts a boolean exclude_none
parameter.
If you set this parameter to True
, the driver doesn't write empty values to
the database. If you set this parameter to False
or leave it blank, the
driver writes None
for each empty field.
The code in the following example writes an Arrow Table
to MongoDB twice. One
of the values in the 'b'
field is set to None
.
The first call to the write()
method omits the exclude_none
parameter,
so it defaults to False
. All values in the Table
, including None
,
are written to the database. The second call to the write()
method sets
exclude_none
to True
, so the empty value in the 'b'
field is ignored.
data_a = [1, 2, 3] data_b = [1, None, 3] data = Table.from_pydict( { "a": data_a, "b": data_b, }, ) coll.drop() write(coll, data) col_data = list(coll.find({})) coll.drop() write(coll, data, exclude_none=True) col_data_exclude_none = list(coll.find({})) print(col_data) print(col_data_exclude_none)
{'_id': ObjectId('...'), 'a': 1, 'b': 1} {'_id': ObjectId('...'), 'a': 2, 'b': None} {'_id': ObjectId('...'), 'a': 3, 'b': 3} {'_id': ObjectId('...'), 'a': 1, 'b': 1} {'_id': ObjectId('...'), 'a': 2} {'_id': ObjectId('...'), 'a': 3, 'b': 3}
Writing to Other Formats
Once result sets have been loaded, you can then write them to any format that the package supports.
For example, to write the table referenced by the variable arrow_table
to a Parquet
file named example.parquet
, run the following code:
import pyarrow.parquet as pq pq.write_table(arrow_table, 'example.parquet')
Pandas also supports writing DataFrame
instances to a variety
of formats, including CSV and HDF. To write the data frame
referenced by the variable df
to a CSV file named out.csv
, run the following
code:
df.to_csv('out.csv', index=False)
The Polars API is a mix of the two preceding examples:
import polars as pl df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) df.write_parquet('example.parquet')
Note
Nested data is supported for parquet read and write operations, but is not well supported by Arrow or Pandas for CSV read and write operations.