Analyze Time-Series Data with Python and MongoDB Using PyMongoArrow and Pandas
Rate this tutorial
In today’s data-centric world, time-series data has become indispensable for driving key organizational decisions, trend analyses, and forecasts. This kind of data is everywhere — from stock markets and IoT sensors to user behavior analytics. But as these datasets grow in volume and complexity, so does the challenge of efficiently storing and analyzing them. Whether you’re an IoT developer or a data analyst dealing with time-sensitive information, MongoDB offers a robust ecosystem tailored to meet both your storage and analytics needs for complex time-series data.
MongoDB has built-in support to store time-series data in a special type of collection called a time-series collection. Time-series collections are different from the normal collections. Time-series collections use an underlying columnar storage format and store data in time-order with an automatically created clustered index. The columnar storage format provides the following benefits:
- Reduced complexity: The columnar format is tailored for time-series data, making it easier to manage and query.
- Query efficiency: MongoDB automatically creates an internal clustered index on the time field which improves query performance.
- Disk usage: This storage approach uses disk space more efficiently compared to traditional collections.
- I/O optimization: The read operations require fewer input/output operations, improving the overall system performance.
- Cache usage: The design allows for better utilization of the WiredTiger cache, further enhancing query performance.
In this tutorial, we will create a time-series collection and then store some time-series data into it. We will see how you can query it in MongoDB as well as how you can read that data into pandas DataFrame, run some analytics on it, and write the modified data back to MongoDB. This tutorial is meant to be a complete deep dive into working with time-series data in MongoDB.
We will be using the following tools/frameworks:
- MongoDB Atlas database, to store our time-series data. If you don’t already have an Atlas cluster created, go ahead and create one, set up a user, and add your connection IP address to your IP access list.
Note: Before running any code or installing any Python packages, we strongly recommend setting up a separate Python environment. This helps to isolate dependencies, manage packages, and avoid conflicts that may arise from different package versions. Creating an environment is an optional but highly recommended step.
At this point, we are assuming that you have an Atlas cluster created and ready to be used, and PyMongo and Jupyter Notebook installed. Let’s go ahead and launch Jupyter Notebook by running the following command in the terminal:
1 Jupyter Notebook
Once you have the Jupyter Notebook up and running, let’s go ahead and fetch the connection string of your MongoDB Atlas cluster and store that as an environment variable, which we will use later to connect to our database. After you have done that, let’s go ahead and connect to our Atlas cluster by running the following commands:
1 import pymongo 2 import os 3 4 from pymongo import MongoClient 5 6 MONGO_CONN_STRING = os.environ.get("MONGODB_CONNECTION_STRING") 7 8 client = MongoClient(MONGO_CONN_STRING)
Next, we are going to create a new database and a collection in our cluster to store the time-series data. We will call this database “stock_data” and the collection “stocks”.
1 # Let's create a new database called "stock data" 2 db = client.stock_data 3 4 # Let's create a new time-series collection in the "stock data" database called "stocks" 5 6 collection = db.create_collection('stocks', timeseries={ 7 8 timeField: "timestamp", 9 metaField: "metadata", 10 granularity: "hours" 11 12 })
Here, we used the db.create_collection() method to create a time-series collection called “stock”. In the example above, “timeField”, “metaField”, and “granularity” are reserved fields (for more information on what these are, visit our documentation). The “timeField” option specifies the name of the field in your collection that will contain the date in each time-series document.
The “metaField” option specifies the name of the field in your collection that will contain the metadata in each time-series document.
Finally, the “granularity” option specifies how frequently data will be ingested in your time-series collection.
Now, let’s insert some stock-related information into our collection. We are interested in storing and analyzing the stock of a specific company called “XYZ” which trades its stock on “NASDAQ”.
We are storing some price metrics of this stock at an hourly interval and for each time interval, we are storing the following information:
- open: the opening price at which the stock traded when the market opened
- close: the final price at which the stock traded when the trading period ended
- high: the highest price at which the stock traded during the trading period
- low: the lowest price at which the stock traded during the trading period
- volume: the total number of shares traded during the trading period
Now that we have become an expert on stock trading and terminology (sarcasm), we will now insert some documents into our time-series collection. Here we have four sample documents. The data points are captured at an interval of one hour.
1 # Create some sample data 2 3 data = [ 4 { 5 "metadata": { 6 "stockSymbol": "ABC", 7 "exchange": "NASDAQ" 8 }, 9 "timestamp": datetime(2023, 9, 12, 15, 19, 48), 10 "open": 54.80, 11 "high": 59.20, 12 "low": 52.60, 13 "close": 53.50, 14 "volume": 18000 15 }, 16 17 { 18 "metadata": { 19 "stockSymbol": "ABC", 20 "exchange": "NASDAQ" 21 }, 22 "timestamp": datetime(2023, 9, 12, 16, 19, 48), 23 "open": 51.00, 24 "high": 54.30, 25 "low": 50.50, 26 "close": 51.80, 27 "volume": 12000 28 }, 29 30 { 31 "metadata": { 32 "stockSymbol": "ABC", 33 "exchange": "NASDAQ" 34 }, 35 "timestamp":datetime(2023, 9, 12, 17, 19, 48), 36 "open": 52.00, 37 "high": 53.10, 38 "low": 50.50, 39 "close": 52.90, 40 "volume": 10000 41 }, 42 43 { 44 "metadata": { 45 "stockSymbol": "ABC", 46 "exchange": "NASDAQ" 47 }, 48 "timestamp":datetime(2023, 9, 12, 18, 19, 48), 49 "open": 52.80, 50 "high": 60.20, 51 "low": 52.60, 52 "close": 55.50, 53 "volume": 30000 54 } 55 ] 56 57 # insert the data into our collection 58 59 collection.insert_many(data)
Now, let’s run a find query on our collection to retrieve data at a specific timestamp. Run this query in the Jupyter Notebook after the previous script.
1 collection.find_one({'timestamp': datetime(2023, 9, 12, 15, 19, 48)})
//OUTPUT
As you can see from the output, we were able to query our time-series collection and retrieve data points at a specific timestamp.
Similarly, you can run more powerful queries on your time-series collection by using the aggregation pipeline. For the scope of this tutorial, we won’t be covering that. But, if you want to learn more about it, here is where you can go:
Now, let’s see how you can move your time-series data into pandas DataFrame to run some analytics operations.
MongoDB has built a tool just for this purpose called PyMongoArrow. PyMongoArrow is a Python library that lets you move data in and out of MongoDB into other data formats such as pandas DataFrame, Numpy array, and Arrow Table.
Let’s quickly install PyMongoArrow using the pip command in your terminal. We are assuming that you already have pandas installed on your system. If not, you can use the pip command to install it too.
1 pip install pymongoarrow
Now, let’s import all the necessary libraries. We are going to be using the same file or notebook (Jupyter Notebook) to run the codes below.
1 import pymongoarrow 2 import pandas as pd 3 4 # pymongoarrow.monkey module provided an interface to patch pymongo, in place, and add pymongoarrow's functionality directly to collection instance. 5 6 from pymongoarrow.monkey import patch_all 7 patch_all() 8 9 # Let's use the pymongoarrow's find_pandas_all() function to read MongoDB query result sets into 10 11 df = collection.find_pandas_all({})
Now, we have read all of our stock data stored in the “stocks” collection into a pandas DataFrame ‘df’.
Let’s quickly print the value stored in the ‘df’ variable to verify it.
1 print(df) 2 3 print(type(df))
//OUTPUT
Hurray…congratulations! As you can see, we have successfully read our MongoDB data into pandas DataFrame.
Now, if you are a stock market trader, you would be interested in doing a lot of analysis on this data to get meaningful insights. But for this tutorial, we are just going to calculate the hourly percentage change in the closing prices of the stock. This will help us understand the daily price movements in terms of percentage gains or losses.
We will add a new column in our ‘df’ DataFrame called “daily_pct_change”.
1 df = df.sort_values('timestamp') 2 3 df['daily_pct_change'] = df['close'].pct_change() * 100 4 5 # print the dataframe to see the modified data 6 print(df)
//OUTPUT
As you can see, we have successfully added a new column to our DataFrame.
Now, we would like to persist the modified DataFrame data into a database so that we can run more analytics on it later. So, let’s write this data back to MongoDB using PyMongoArrow’s write function.
We will just create a new collection called “my_new_collection” in our database to write the modified DataFrame back into MongoDB, ensuring data persistence.
1 from pymongoarrow.api import write 2 3 coll = db.my_new_collection 4 5 # write data from pandas into MongoDB collection called 'coll' 6 write(coll, df) 7 8 # Now, let's verify that the modified data has been written into our collection 9 10 print(coll.find_one({}))
Congratulations on successfully completing this tutorial.
In this tutorial, we covered how to work with time-series data using MongoDB and Python. We learned how to store stock market data in a MongoDB time-series collection, and then how to perform simple analytics using a pandas DataFrame. We also explored how PyMongoArrow makes it easy to move data between MongoDB and pandas. Finally, we saved our analyzed data back into MongoDB. This guide provides a straightforward way to manage, analyze, and store time-series data. Great job if you’ve followed along — you’re now ready to handle time-series data in your own projects.
If you want to learn more about PyMongoArrow, check out some of these additional resources: