Analyze Time-Series Data with Python and MongoDB Using PyMongoArrow and Pandas

Shubham Ranjan6 min read • Published Sep 21, 2023 • Updated Sep 21, 2023

MongoDB Time series

Rate this tutorial

In today’s data-centric world, time-series data has become indispensable for driving key organizational decisions, trend analyses, and forecasts. This kind of data is everywhere — from stock markets and IoT sensors to user behavior analytics. But as these datasets grow in volume and complexity, so does the challenge of efficiently storing and analyzing them. Whether you’re an IoT developer or a data analyst dealing with time-sensitive information, MongoDB offers a robust ecosystem tailored to meet both your storage and analytics needs for complex time-series data.

MongoDB has built-in support to store time-series data in a special type of collection called a time-series collection. Time-series collections are different from the normal collections. Time-series collections use an underlying columnar storage format and store data in time-order with an automatically created clustered index. The columnar storage format provides the following benefits:

Reduced complexity: The columnar format is tailored for time-series data, making it easier to manage and query.
Query efficiency: MongoDB automatically creates an internal clustered index on the time field which improves query performance.
Disk usage: This storage approach uses disk space more efficiently compared to traditional collections.
I/O optimization: The read operations require fewer input/output operations, improving the overall system performance.
Cache usage: The design allows for better utilization of the WiredTiger cache, further enhancing query performance.

In this tutorial, we will create a time-series collection and then store some time-series data into it. We will see how you can query it in MongoDB as well as how you can read that data into pandas DataFrame, run some analytics on it, and write the modified data back to MongoDB. This tutorial is meant to be a complete deep dive into working with time-series data in MongoDB.

Tutorial Prerequisites

We will be using the following tools/frameworks:

MongoDB Atlas database, to store our time-series data. If you don’t already have an Atlas cluster created, go ahead and create one, set up a user, and add your connection IP address to your IP access list.
PyMongo driver(to connect to your MongoDB Atlas database, see the installation instructions).
Jupyter Notebook (to run the code, see the installation instructions).

Note: Before running any code or installing any Python packages, we strongly recommend setting up a separate Python environment. This helps to isolate dependencies, manage packages, and avoid conflicts that may arise from different package versions. Creating an environment is an optional but highly recommended step.

At this point, we are assuming that you have an Atlas cluster created and ready to be used, and PyMongo and Jupyter Notebook installed. Let’s go ahead and launch Jupyter Notebook by running the following command in the terminal:

1 Jupyter Notebook

Once you have the Jupyter Notebook up and running, let’s go ahead and fetch the connection string of your MongoDB Atlas cluster and store that as an environment variable, which we will use later to connect to our database. After you have done that, let’s go ahead and connect to our Atlas cluster by running the following commands:

1 import pymongo
2 import os
3 
4 from pymongo import MongoClient
5 
6 MONGO_CONN_STRING = os.environ.get("MONGODB_CONNECTION_STRING")
7 
8 client = MongoClient(MONGO_CONN_STRING)

Creating a time-series collection

Next, we are going to create a new database and a collection in our cluster to store the time-series data. We will call this database “stock_data” and the collection “stocks”.

1 # Let's create a new database called "stock data"
2 db = client.stock_data
3 
4 # Let's create a new time-series collection in the "stock data" database called "stocks"
5 
6 collection = db.create_collection('stocks', timeseries={
7 
8           timeField: "timestamp",
9           metaField: "metadata",
10           granularity: "hours"
11 
12 })

Here, we used the db.create_collection() method to create a time-series collection called “stock”. In the example above, “timeField”, “metaField”, and “granularity” are reserved fields (for more information on what these are, visit our documentation). The “timeField” option specifies the name of the field in your collection that will contain the date in each time-series document.

The “metaField” option specifies the name of the field in your collection that will contain the metadata in each time-series document.

Finally, the “granularity” option specifies how frequently data will be ingested in your time-series collection.

Now, let’s insert some stock-related information into our collection. We are interested in storing and analyzing the stock of a specific company called “XYZ” which trades its stock on “NASDAQ”.

We are storing some price metrics of this stock at an hourly interval and for each time interval, we are storing the following information:

open: the opening price at which the stock traded when the market opened
close: the final price at which the stock traded when the trading period ended
high: the highest price at which the stock traded during the trading period
low: the lowest price at which the stock traded during the trading period
volume: the total number of shares traded during the trading period

Now that we have become an expert on stock trading and terminology (sarcasm), we will now insert some documents into our time-series collection. Here we have four sample documents. The data points are captured at an interval of one hour.

1 # Create some sample data
2 
3 data = [
4 {
5    "metadata": {
6       "stockSymbol": "ABC",
7       "exchange": "NASDAQ"
8    },
9    "timestamp": datetime(2023, 9, 12, 15, 19, 48),
10    "open": 54.80,
11    "high": 59.20,
12    "low": 52.60,
13    "close": 53.50,
14    "volume": 18000
15 },
16 
17 {
18    "metadata": {
19       "stockSymbol": "ABC",
20       "exchange": "NASDAQ"
21    },
22    "timestamp": datetime(2023, 9, 12, 16, 19, 48),
23    "open": 51.00,
24    "high": 54.30,
25    "low": 50.50,
26    "close": 51.80,
27    "volume": 12000
28 },
29 
30 {
31    "metadata": {
32       "stockSymbol": "ABC",
33       "exchange": "NASDAQ"
34    },
35    "timestamp":datetime(2023, 9, 12, 17, 19, 48),
36    "open": 52.00,
37    "high": 53.10,
38    "low": 50.50,
39    "close": 52.90,
40    "volume": 10000
41 },
42 
43 {
44    "metadata": {
45       "stockSymbol": "ABC",
46       "exchange": "NASDAQ"
47    },
48    "timestamp":datetime(2023, 9, 12, 18, 19, 48),
49    "open": 52.80,
50    "high": 60.20,
51    "low": 52.60,
52    "close": 55.50,
53    "volume": 30000
54 }
55 ]
56 
57 # insert the data into our collection
58 
59 collection.insert_many(data)

Now, let’s run a find query on our collection to retrieve data at a specific timestamp. Run this query in the Jupyter Notebook after the previous script.

1 collection.find_one({'timestamp': datetime(2023, 9, 12, 15, 19, 48)})

//OUTPUT

As you can see from the output, we were able to query our time-series collection and retrieve data points at a specific timestamp.

Similarly, you can run more powerful queries on your time-series collection by using the aggregation pipeline. For the scope of this tutorial, we won’t be covering that. But, if you want to learn more about it, here is where you can go:

Analyzing the data with a pandas DataFrame

Now, let’s see how you can move your time-series data into pandas DataFrame to run some analytics operations.

MongoDB has built a tool just for this purpose called PyMongoArrow. PyMongoArrow is a Python library that lets you move data in and out of MongoDB into other data formats such as pandas DataFrame, Numpy array, and Arrow Table.

Let’s quickly install PyMongoArrow using the pip command in your terminal. We are assuming that you already have pandas installed on your system. If not, you can use the pip command to install it too.

1 pip install pymongoarrow

Now, let’s import all the necessary libraries. We are going to be using the same file or notebook (Jupyter Notebook) to run the codes below.

1 import pymongoarrow
2 import pandas as pd
3 
4 # pymongoarrow.monkey module provided an interface to patch pymongo, in place, and add pymongoarrow's functionality directly to collection instance. 
5 
6 from pymongoarrow.monkey import patch_all
7 patch_all()
8 
9 # Let's use the pymongoarrow's find_pandas_all() function to read MongoDB query result sets into 
10 
11 df = collection.find_pandas_all({})

Now, we have read all of our stock data stored in the “stocks” collection into a pandas DataFrame ‘df’.

Let’s quickly print the value stored in the ‘df’ variable to verify it.

1 print(df)
2 
3 print(type(df))

//OUTPUT

Hurray…congratulations! As you can see, we have successfully read our MongoDB data into pandas DataFrame.

Now, if you are a stock market trader, you would be interested in doing a lot of analysis on this data to get meaningful insights. But for this tutorial, we are just going to calculate the hourly percentage change in the closing prices of the stock. This will help us understand the daily price movements in terms of percentage gains or losses.

We will add a new column in our ‘df’ DataFrame called “daily_pct_change”.

1 df = df.sort_values('timestamp')
2 
3 df['daily_pct_change'] = df['close'].pct_change() * 100
4 
5 # print the dataframe to see the modified data
6 print(df)

//OUTPUT

As you can see, we have successfully added a new column to our DataFrame.

Now, we would like to persist the modified DataFrame data into a database so that we can run more analytics on it later. So, let’s write this data back to MongoDB using PyMongoArrow’s write function.

We will just create a new collection called “my_new_collection” in our database to write the modified DataFrame back into MongoDB, ensuring data persistence.

1 from pymongoarrow.api import write
2 
3 coll = db.my_new_collection
4 
5 # write data from pandas into MongoDB collection called 'coll'
6 write(coll, df)
7 
8 # Now, let's verify that the modified data has been written into our collection
9 
10 print(coll.find_one({}))

Congratulations on successfully completing this tutorial.

Conclusion

In this tutorial, we covered how to work with time-series data using MongoDB and Python. We learned how to store stock market data in a MongoDB time-series collection, and then how to perform simple analytics using a pandas DataFrame. We also explored how PyMongoArrow makes it easy to move data between MongoDB and pandas. Finally, we saved our analyzed data back into MongoDB. This guide provides a straightforward way to manage, analyze, and store time-series data. Great job if you’ve followed along — you’re now ready to handle time-series data in your own projects.

If you want to learn more about PyMongoArrow, check out some of these additional resources:

Rate this tutorial

Tutorial

Encryption in Use: The Magic of MongoDB for Secure Queries

Jan 06, 2025 | 7 min read

Article

MongoDB ORMs, ODMs, and Libraries

Aug 28, 2024 | 3 min read

Quickstart

Best Practices for Using Flask and MongoDB

Sep 16, 2024 | 5 min read

Quickstart

Getting Started With MongoDB and Starlette

Jul 12, 2024 | 5 min read

Creating a time-series collection
Analyzing the data with a pandas DataFrame
Conclusion

MongoDB

Analyze Time-Series Data with Python and MongoDB Using PyMongoArrow and Pandas

Tutorial Prerequisites

Creating a time-series collection

Analyzing the data with a pandas DataFrame

Conclusion

Related

Encryption in Use: The Magic of MongoDB for Secure Queries

MongoDB ORMs, ODMs, and Libraries

Best Practices for Using Flask and MongoDB

Getting Started With MongoDB and Starlette

Table of Contents

1	import pymongo
2	import os
3
4	from pymongo import MongoClient
5
6	MONGO_CONN_STRING = os.environ.get("MONGODB_CONNECTION_STRING")
7
8	client = MongoClient(MONGO_CONN_STRING)

1	# Let's create a new database called "stock data"
2	db = client.stock_data
3
4	# Let's create a new time-series collection in the "stock data" database called "stocks"
5
6	collection = db.create_collection('stocks', timeseries={
7
8	timeField: "timestamp",
9	metaField: "metadata",
10	granularity: "hours"
11
12	})

1	# Create some sample data
2
3	data = [
4	{
5	"metadata": {
6	"stockSymbol": "ABC",
7	"exchange": "NASDAQ"
8	},
9	"timestamp": datetime(2023, 9, 12, 15, 19, 48),
10	"open": 54.80,
11	"high": 59.20,
12	"low": 52.60,
13	"close": 53.50,
14	"volume": 18000
15	},
16
17	{
18	"metadata": {
19	"stockSymbol": "ABC",
20	"exchange": "NASDAQ"
21	},
22	"timestamp": datetime(2023, 9, 12, 16, 19, 48),
23	"open": 51.00,
24	"high": 54.30,
25	"low": 50.50,
26	"close": 51.80,
27	"volume": 12000
28	},
29
30	{
31	"metadata": {
32	"stockSymbol": "ABC",
33	"exchange": "NASDAQ"
34	},
35	"timestamp":datetime(2023, 9, 12, 17, 19, 48),
36	"open": 52.00,
37	"high": 53.10,
38	"low": 50.50,
39	"close": 52.90,
40	"volume": 10000
41	},
42
43	{
44	"metadata": {
45	"stockSymbol": "ABC",
46	"exchange": "NASDAQ"
47	},
48	"timestamp":datetime(2023, 9, 12, 18, 19, 48),
49	"open": 52.80,
50	"high": 60.20,
51	"low": 52.60,
52	"close": 55.50,
53	"volume": 30000
54	}
55	]
56
57	# insert the data into our collection
58
59	collection.insert_many(data)

1	import pymongoarrow
2	import pandas as pd
3
4	# pymongoarrow.monkey module provided an interface to patch pymongo, in place, and add pymongoarrow's functionality directly to collection instance.
5
6	from pymongoarrow.monkey import patch_all
7	patch_all()
8
9	# Let's use the pymongoarrow's find_pandas_all() function to read MongoDB query result sets into
10
11	df = collection.find_pandas_all({})

1	df = df.sort_values('timestamp')
2
3	df['daily_pct_change'] = df['close'].pct_change() * 100
4
5	# print the dataframe to see the modified data
6	print(df)

1	from pymongoarrow.api import write
2
3	coll = db.my_new_collection
4
5	# write data from pandas into MongoDB collection called 'coll'
6	write(coll, df)
7
8	# Now, let's verify that the modified data has been written into our collection
9
10	print(coll.find_one({}))