Time Series

9 results

Transforming Predictive Maintenance with AI: Real-Time Audio-Based Diagnostics with Atlas Vector Search

Wind turbines are a critical component in the shift away from fossil fuels toward more sustainable, green sources of energy. According to the International Energy Agency (IEA), the global capacity of wind energy has been growing rapidly, reaching over 743 gigawatts by 2023. Wind energy, in particular, has one of the greatest potentials to increase countries' renewable capacity growth. Solar PV and wind additions are forecast to more than double by 2028 compared with 2022, continuously breaking records over the forecast period. This growth highlights the increasing reliance on wind power and, consequently, the need for effective maintenance strategies. Keeping wind turbines operating at maximum capacity is essential to ensuring their continued contribution to the energy grid. Like any mechanical device, wind turbines must undergo periodic maintenance to keep them operating at optimal levels. In recent years, advancements in technology—particularly in AI and machine learning—have played a significant role by introducing predictive maintenance breakthroughs to industrial processes like periodic maintenance. By integrating AI into renewable energy systems, organizations of all sizes can reduce costs and gain efficiencies. In this post, we will dig into an AI application use case for real-time anomaly detection through sound input, showcasing the impact of AI and MongoDB Atlas Vector Search for predictive maintenance of wind turbines. Check out our AI resource page to learn more about building AI-powered apps with MongoDB. Predictive Maintenance in Modern Industries Companies increasingly invest in predictive maintenance to optimize their operations and drive efficiency. Research from Deloitte indicates that predictive maintenance can reduce equipment downtime by 5–15 percent, increase labor productivity by 5–20 percent, and reduce overall new equipment costs by 3–5 percent. This helps organizations maximize their investment in equipment and infrastructure. By implementing predictive maintenance strategies, companies can anticipate equipment failures before they occur, ultimately resulting in longer equipment lifetimes, tighter budget control, and higher overall throughput. More concretely, businesses aim to reduce mean time to repair, optimal ordering of replacement parts, efficient people management, and reduced overall maintenance costs. Leveraging data interoperability, real-time analysis, modeling and simulation, and machine learning techniques, predictive maintenance enables companies to thrive in today's competitive landscape. However, despite its immense potential, predictive maintenance also presents significant challenges. One major hurdle is the consolidation of heterogeneous data, as predictive maintenance systems often need to integrate data from various formats and sources that can be difficult to integrate. Scalability also becomes a concern when dealing with the high volumes of IoT signals generated by numerous devices and sensors. And lastly, managing and analyzing this vast amount of data in real-time poses challenges that must be overcome to realize the full benefits of predictive maintenance initiatives. At its core, predictive maintenance begins with real-time diagnostics, enabling proactive identification and mitigation of potential equipment failures in real-time. Figure 1: Predictive Maintenance starts with real-time diagnostics However, while AI has been employed for real-time diagnostics for some time, the main challenge has been acquiring and utilizing the necessary data for training AI models. Traditional methods have struggled with incorporating unstructured data into these models effectively. Enter gen AI and vector search technologies, positioned to revolutionize this landscape. Flexible data platforms working together with AI algorithms can help generate insights from diverse data types, including images, video, audio, geospatial data, and more, paving the way for more robust and efficient maintenance strategies. In this context, MongoDB Atlas Vector Search stands out as a foundational element for effective and efficient gen AI-powered predictive maintenance models. Why MongoDB and Atlas Vector Search? For several reasons, MongoDB stands out as the preferred database solution for modern applications. Figure 2: MongoDB Atlas Developer Data Platform Document data model One of the reasons why the document model is well-suited to the needs of modern applications is its ability to store diverse data types in BSON (Binary JSON) format, ranging from structured to unstructured. This flexibility essentially eliminates the middle layer necessary to convert to a SQL-like format, resulting in easier-to-maintain applications, lower development times, and faster response to changes. Time series collections MongoDB excels in handling time-series data generated by edge devices, IoT sensors, PLCs, SCADA systems, and more. With dedicated time-series collections, MongoDB provides efficient storage and retrieval of time-stamped data, enabling real-time monitoring and analysis. Real-time data processing and aggregation MongoDB's adeptness in real-time data processing is crucial for immediate diagnostics and responses, ensuring timely interventions to prevent costly repairs and downtime. Its powerful aggregation capabilities facilitate the synthesis of data from multiple sources, providing comprehensive insights into fleet-wide performance trends. Developer data platform Beyond just storing data, MongoDB Atlas is a multi-cloud developer data platform, providing the flexibility required to build a diverse range of applications. Atlas includes features like transactional processing, text-based search, vector search, in-app analytics, and more through an elegant and integrated suite of data services. It offers developers a top-tier experience through a unified query interface, all while meeting the most demanding requirements for resilience, scalability, and cybersecurity. Atlas Vector Search Among the out-of-the-box features offered by MongoDB Atlas, Atlas Vector Search stands out, enabling the search of unstructured data effortlessly. You can generate vector embeddings with machine learning models like the ones found in OpenAI or Hugging Face, and store and index them in Atlas. This feature facilitates the indexing of vector representations of objects and retrieves those that are semantically most similar to your query. Explore the capabilities of Atlas Vector Search . This functionality is especially interesting for unstructured data that was previously hard to leverage, such as text, images, and audio, allowing searches that combine audio, video, metadata, production equipment data, or sensor measurements to provide an answer to a query. Let's delve into how simple it is to leverage AI to significantly enhance the sophistication of predictive maintenance models with MongoDB Atlas. Real-time audio-based diagnostics with Atlas Vector Search In our demonstration, we'll showcase real-time audio-based diagnostics applied to a wind turbine. It's important to note that while we focus on wind turbines here, the concept can be extrapolated to any machine, vehicle, or device emitting sound. To illustrate this concept, we'll utilize a handheld fan as our makeshift wind turbine. Wind turbines emit different sounds depending on their operational status. By continuously monitoring the turbine’s audio, our system can accurately specify the current operational status of the equipment and reduce the risk of unexpected breakdowns. Early detection of potential issues allows for enhanced operational efficiency, minimizing the time and resources required for manual inspections. Additionally, timely identification can prevent costly repairs and reduce overall turbine downtime, thus enhancing cost-effectiveness. Now, let’s have a look at how this demo works! Figure 3: Application Architecture Audio Preparation We begin by capturing the audio from the equipment in different situations (normal operation, high vs. low load, equipment obstructed, not operating, etc.). Once each sound is collected, we use an embedding model to process the audio data to convert it to a vector. This step is crucial because by generating embeddings for each audio track, which are high-dimensional vector representations, we are essentially capturing the unique characteristics of the sound. We then upload these vector embeddings to MongoDB Atlas. By adding just a few examples of sounds in our database, they are ready to be searched (and essentially compared) with the sound emitted by our equipment during its operation in real-time. Audio-based diagnosis Now, we put our equipment into normal operation and start capturing the sound it is making in real-time. In this demonstration, we capture one-second clips of audio. Then, with the same embedding model used before, we take our audio clips and convert them to vector embeddings in real-time. This process happens in milliseconds, allowing us to have real-time monitoring of the audio. The one-second audio clips, now converted to vector embeddings, are then sent to MongoDB Atlas Vector Search, which can search for and find the most similar vectors from the ones we previously recorded in our audio preparation phase. The result is given back with a percentage of similarity, enabling a very accurate prediction of the current status of the operation of the wind turbine. These steps are performed repeatedly every second, leveraging fast embedding of vectors and quick searches, allowing for real-time monitoring based on sound. Check out the video below to see it in action! Transforming Predictive Maintenance with AI and MongoDB Predictive maintenance offers substantial benefits but poses challenges like data integration and scalability. MongoDB stands out as a preferred database solution, offering scalability, flexibility, and real-time data processing. As technology advances, AI integration promises to further revolutionize the industry. Thank you to Ralph Johnson and Han Heloir for their valuable contributions to this demo! Head over to our quick-start guide to get started with Atlas Vector Search today. Explore how MongoDB empowers manufacturing operations by visiting these resources: Generative AI in Predictive Maintenance Applications Transforming Industries with MongoDB and AI: Manufacturing and Motion MongoDB for Automotive: Driving Innovation from Factory to Finish Line

May 28, 2024

Atlas Stream Processing is Now Generally Available!

We're thrilled to announce that Atlas Stream Processing —the MongoDB-native way to process streaming data—is now generally available, empowering developers to quickly build responsive, event-driven applications! Our team spent the last two years defining a vision and building a product that leans into MongoDB’s strengths to overcome the hard challenges in stream processing. After a decade of building stream processing products outside of MongoDB, we are using everything that makes MongoDB unique and differentiated—the Query API and powerful aggregation framework, as well as the document model and its schema flexibility—to create an awesome developer experience. It’s a new approach to stream processing, and based on the feedback of so many of you in our community, it’s the best way for most developers using MongoDB to do it. Let’s get into what’s new. This post is also available in: Deutsch , Français , Español , Português , Italiano , 한국어 , 简体中文 . What's new in general availability? Production Readiness Ready to support your production workloads, ensuring reliable and scalable stream processing for your mission-critical applications. Time Series Collection Support Emit processor results into Time Series Collections . Pre-process data continuously while saving it for historical access later in a collection type available in MongoDB Atlas built to efficiently store and query time series data. Development and Production Tiers Besides the SP30 cluster tier available during the public preview, we’re introducing an SP10 tier to provide flexibility and a cost-effective option for exploratory use cases and low-traffic stream processing workloads. Improved Kafka Support Added support for Kafka headers allows applications to provide additional metadata alongside event data. They are helpful for various stream processing use cases (e.g., routing messages, conditional processing, and more). Least Privilege Access Atlas Database Users can grant access to Stream Processing Instances and enable access to only those who need it. Read our tutorial for more information. Stream Processor Alerting Gain insight and visibility into the health of your stream processors by creating alerts for when a failure occurs. Supported methods for alerting include email, SMS, monitoring platforms like Datadog, and more . Why Atlas Stream Processing? Atlas Stream Processing brings the power and flexibility of MongoDB's document model and Query API to the challenging stream processing space. With Atlas Stream Processing, developers can: Effortlessly handle complex and rapidly changing data structures Use the familiar MongoDB Query API for processing streaming data Seamlessly integrate with MongoDB Atlas Benefit from a fully managed service that eliminates operational overhead Customer highlights Read what developers are saying about Atlas Stream Processing: At Acoustic, our key focus is to empower brands with behavioral insights that enable them to create engaging, personalized customer experiences. To do so, our Acoustic Connect platform must be able to efficiently process and manage millions of marketing, behavioral, and customer signals as they occur. With Atlas Stream Processing, our engineers can leverage the skills they already have from working with data in Atlas to process new data continuously, ensuring our customers have access to real-time customer insights. John Riewerts, EVP, Engineering at Acoustic Atlas Stream Processing enables us to process, validate, and transform data before sending it to our messaging architecture in AWS powering event-driven updates throughout our platform. The reliability and performance of Atlas Stream Processing has increased our productivity, improved developer experience, and reduced infrastructure cost. Cody Perry, Software Engineer, Meltwater What's ahead for Atlas Stream Processing? We’re rapidly introducing new features and functionality to ensure MongoDB delivers a world-class stream processing experience for all development teams. Over the next few months, you can expect to see: Advanced Networking Support Support for VPC Peering to Kafka Clusters for teams requiring additional networking capabilities Expanded Cloud Region Support Support for all cloud regions available in Atlas Data Federation Expanded Cloud Provider Support Support for Microsoft Azure Expanded Data Source and Sink Support We have plans to expand beyond Kafka and Atlas databases in the coming months. Let us know which sources and sinks you need, and we will factor that into our planning Richer Metrics & Observability Support for expanded visibility into your stream processors to help simplify monitoring and troubleshooting Expanded Deployment Flexibility Support for deploying stream processors with Terraform. This integration will help to enable a seamless CI/CD pipeline, enhancing operational efficiency with infrastructure as code. Look out for a dedicated blog in the near future on how to get started with Atlas Stream Processing and Terraform. So whether you're looking to process high-velocity sensor data, continuously analyze customer data to deliver personalized experiences, or perform predictive maintenance to increase yields and reduce costs, Atlas Stream Processing has you covered. Join the hundreds of development teams already building with Atlas Stream Processing. Stay tuned to hear more from us soon, and good luck building! Login today or check out our introductory tutorial to get started.

May 2, 2024

Transforming Industries with MongoDB and AI: Manufacturing and Motion

This is the first in a six-part series focusing on critical AI use cases across several industries . The series covers the manufacturing and motion, financial services, retail, telecommunications and media, insurance, and healthcare industries. The integration of artificial intelligence (AI) within the manufacturing and automotive industries has transformed the conventional value chain, presenting a spectrum of opportunities. Leveraging Industrial IoT, companies now collect extensive data from assets, paving the way for analytical insights and unlocking novel AI use cases, including enhanced inventory management and predictive maintenance. Inventory management Efficient supply chains can control operational costs and ensure on-time delivery to their customers. Inventory optimization and management is a key component in achieving these goals. Managing and optimizing inventory levels, planning for fluctuations in demand, and of course, cutting costs are all imperative goals. However, efficient inventory management for manufacturers presents complex data challenges too, primarily in forecasting demand accurately and optimizing stock levels. This is where AI can help. Figure 1: Gen AI-enabled demand forecasting with MongoDB Atlas AI algorithms can be used to analyze complex datasets to predict future demand for products or parts. Improvement in demand forecasting accuracy is crucial for maintaining optimal inventory levels. AI-based time series forecasting can assist in adapting to rapid changes in customer demand. Once the demand is known, AI can play a pivotal role in stock optimization. By analyzing historical sales data and market trends, manufacturers can determine the most efficient stock levels and even reduce human error. On top of all this existing potential, generative AI can help with generating synthetic inventory data and seasonally adjusted demand patterns. It can also help with creating scenarios to simulate supply chain disruptions. MongoDB Atlas makes this process simple. At the warehouse, the inventory can be scanned using a mobile device. This data is persisted in Atlas Device SDK and synced with Atlas using Device Sync, which is used by MongoDB customers like Grainger . Atlas Device Sync provides an offline-first seamless mobile experience for inventory tracking, making sure that inventory data is always accurate in Atlas. Once data is in Atlas, it can serve as the central repository for all inventory-related data. This repository becomes the source of data for inventory management AI applications, eliminating data silos and improving visibility into overall inventory levels and movements. Using Atlas Vector Search and generative AI, manufacturers can easily categorize products based on their seasonal attributes, cluster products with similar seasonal demand patterns, and provide context to the foundation model to improve the accuracy of synthetic inventory data generation. Predictive maintenance The most basic approach to maintenance today is reactive — assets are deliberately allowed to operate until failures actually occur. The assets are maintained as needed, making it challenging to anticipate repairs. Preventive maintenance, however, allows systems or components to be replaced based on a conservative schedule to prevent commonly occurring failures — although predictive maintenance is expensive to implement due to frequent replacement of parts before end-of-life. Figure 2: Audio-based anomaly detection with MongoDB Atlas. Scan the QR code to try it out yourself. AI offers a chance to efficiently implement predictive maintenance using data collected from IoT sensors on machinery trained to detect anomalies. ML/AI algorithms like regression models or decision trees are trained on the preprocessed data, deployed on-site for inference, and continuously analyzed sensor data. When anomalies are detected, alerts are generated to notify maintenance personnel, enabling proactive planning and execution of maintenance actions to minimize downtime and optimize equipment reliability and performance. A retrieval-augmented generation (RAG) architecture can be deployed to generate or curate the data preprocessor removing the need for specialized data science knowledge. The domain expert can provide the right prompts for the large language model. Once the maintenance alert is generated by an AI model, generative AI can come in again to suggest a repair strategy, taking spare parts inventory data, maintenance budget, and personal availability into consideration. Finally, the repair manuals can be vectorized and used to power a chatbot application that guides the technician in performing the actual repair. MongoDB documents are inherently flexible while allowing data governance when required. Since machine health prediction models require not just sensor data but also maintenance history and inventory data, the document model is a perfect fit to model such disparate data sources. During the maintenance and support process of a physical product, information such as product information and replacement parts documentation must be available and easily accessible to support staff. Full-text search capabilities provided by Atlas Search can be integrated with the support portal and help staff retrieve information from Atlas clusters with ease. Atlas Vector Search is a foundational element for effective and efficiently powered predictive maintenance models. Manufacturers can use MongoDB Atlas to explore ways of simplifying machine diagnostics. Audio files can be recorded from machines, which can then be vectorized and searched to retrieve similar cases. Once the cause is identified, they can use RAG to implement a chatbot interface that the technician can interact with and get context-aware, step-by-step guidance on how to perform the repair. Autonomous driving With the rise of connected vehicles, automotive manufacturers have been compelled to transform their business models into software-first organizations. The data generated by connected vehicles is used to create better driver assistance systems, paving the way for autonomous driving applications. However, it is challenging to create fully autonomous vehicles that can drive safer than humans. Some experts estimate that the technology to achieve level 5 autonomy is about 80% developed — but the remaining 20% will be extremely hard to achieve and will take a lot of time to perfect. Figure 3: MongoDB Atlas’s Role in Autonomous Driving AI-based image and object recognition in automotive applications face uncertainties, but manufacturers must utilize data from radar, LiDAR, cameras, and vehicle telemetry to improve AI model training. Modern vehicles act as data powerhouses, constantly gathering and processing information from onboard sensors and cameras, generating significant Big Data. Robust storage and analysis capabilities are essential to manage this data, while real-time analysis is crucial for making instantaneous decisions to ensure safe navigation. MongoDB can play a significant role in addressing these challenges. The document model is an excellent way to accommodate diverse data types such as sensor readings, telematics, maps, and model results. New fields to the documents can be added at run time, enabling the developers to easily add context to the raw telemetry data. MongoDB’s ability to handle large volumes of unstructured data makes it suitable for the constant influx of vehicle-generated information. Atlas Search provides a performant search engine to allow data scientists to iterate their perception AI models. Finally, Atlas Device Sync can be used to send configuration updates to the vehicle's advanced driving assistance system Other notable use cases AI plays a critical role in fulfilling the promise of Industry 4.0. Numerous other use cases of AI can be enabled by MongoDB Atlas, some of which include: Logistics Optimization: AI can help optimize routes resulting in reduced delays and enhanced efficiency in day-to-day delivery operations. Quality Control and Defect Detection: Computer or machine vision can be used to identify irregularities in the products as they are manufactured. This ensures that product standards are met with precision. Production Optimization: By analyzing time series data from sensors installed on production lines, waste can be identified and reduced, thereby improving throughput and efficiency. Smart After Sales Support: Manufacturers can utilize AI-driven chatbots and predictive analytics to offer proactive maintenance, troubleshooting, and personalized assistance to customers. Personalized Product Recommendations: AI can be used to analyze user behavior and preferences to deliver personalized product recommendations via a mobile or web app, enhancing customer satisfaction and driving sales. The integration of AI in manufacturing and automotive industries has revolutionized traditional processes, offering a plethora of opportunities for efficiency and innovation. With industrial IoT and advanced analytics, companies can now harness vast amounts of data to enhance inventory management and predictive maintenance. AI-driven demand forecasting ensures optimal stock levels, while predictive maintenance techniques minimize downtime and optimize equipment performance. Moreover, as automotive manufacturers work toward autonomous driving, AI-powered image recognition and real-time data analysis become paramount. MongoDB Atlas emerges as a pivotal solution, providing flexible document modeling and robust storage capabilities to handle the complexities of Industry 4.0. Beyond the manufacturing and automotive sectors, the potential of AI-enabled by MongoDB Atlas extends to logistics optimization, quality control, production efficiency, smart after-sales support, and personalized customer experiences, shaping the future of Industry 4.0 and beyond. Learn more about AI use cases for top industries in our new white paper, “ How Leading Industries are Transforming with AI and MongoDB Atlas .” Head over to our quick-start guide to get started with Atlas Vector Search today.

March 19, 2024

3 Reasons (and 2 Ways) to Use MongoDB’s Improved Time Series Collections

Time series data, which reflects measurements taken at regular time intervals, plays a critical role in a wide variety of use cases for a diverse range of industries. For example, park management agencies can use time series data to examine attendance at public parks to better understand peak times and schedule services accordingly. Retail companies, such as Walmart , depend on it to analyze consumer spending patterns down to the minute, to better predict demand and improve shift scheduling, hiring, warehousing, and other logistics. As more sensors and devices are added to networks, time series data and its associated tools have become more important . In this article, we’ll look at three reasons (and two ways) to use MongoDB time series collections in your stack. This in-depth introduction to time series data features MongoDB Product Manager Michael Gargiulo. Reason 1: Purpose-built for the challenges of time series data At first glance, time series collections resemble other collections within MongoDB, with similar functionalities and usage. Beneath the surface, however, they are specifically designed for storing, sorting, and working with time series data. For developers, query speed and data accessibility continue to be challenges associated with time series data. Because of how quickly time series data can accumulate, it must be organized and sorted in a logical way to ensure that queries and their associated operations can run smoothly and quickly. To address this issue, time series collections implement a key tenet of the MongoDB developer data platform: Data that is stored together is accessed together. Documents (the basic building block of MongoDB data) are grouped into buckets, which are organized by time. Each bucket contains time series data from a variety of sources — all of which were gathered from the same time period and all of which are likely to show up on the same queries. For example, if you are using time series collections to analyze the rise in summer temperatures of Valencia, Spain from 1980 to 2020, then one bucket will contain temperatures for August 1991. Relevant, but distinct buckets (such as temperatures for the months of June and July 1991) would also be stored on the same page for faster, easier access. MongoDB also lets you create compound indexes on any measurement field in the bucket (whether it’s timeField or metaField) for faster, more flexible queries. Because of the wide variety of indexing options, operations on time series data can be executed much more quickly than with competing products. For example, scan times are reduced by indexing buckets of documents (each of which has a unique identifier) rather than individual documents. In terms of the previous example, you could create an index on the minimum and maximum average summer temperatures in Valencia, Spain from 1980 to 2020 to more quickly surface necessary data. That way, MongoDB does not have to scan the entire dataset to find min and max values over a period of nearly four decades. Another concern for developers is finding the last metadata value, which in other solutions, requires users to scan the entire data set — a time-consuming process. Instead, time series collections use last point queries, where MongoDB simply retrieves the last measurement for each metadata value. As with other fields, users can also create indexes for last points in their data. In our example, you could create an index to identify the end of summer temperatures in Valencia from 1980 to 2020. By indexing the last values, time series collections can drastically reduce query times. Another recurring challenge for time series applications is data loss from Internet of Things (IoT) applications for industries such as manufacturing, meteorology, and more. As sensors go offline and gaps in your data appear, it becomes much more difficult to run analytics, which require a continuous, uninterrupted flow of data. As a solution, the MongoDB team created densification and gap filling. Densification, executed by the $densify command, creates blank, placeholder documents to fill in any missing timestamps. Users can then sort data by time and run the $fill command for gap filling. This process will estimate and add in any null or missing values in documents based on existing data. By using these two capabilities in tandem, you will get a steady flow of data to input into aggregation pipelines for insights. Reason 2: Keep everything in house, in one data platform Juggling different data tools and platforms can be exhausting. Cramming a bunch of separate products and technologies into a single infrastructure can create complex architectures and require significant operational overhead. Additionally, a third-party time series solution may not be compatible with your existing workflows and may necessitate more workarounds just to keep things running smoothly. The MongoDB developer data platform brings together several products and features into a single, intuitive ecosystem, so developers can use MongoDB to address many common needs — from time series data to change streams — while reducing time-consuming maintenance and overhead. As a result, users can take advantage of the full range of MongoDB features to collect, analyze, and transform time series data. You can query time series collections through the MongoDB Compass GUI or the MongoDB Shell , utilize familiar MongoDB capabilities like nesting data within documents, secondary indexes, and operators like $lookup or $merge, and process time series data through aggregation pipelines to extract insights and inform decision making. Reason 3: Logical ways to organize and access time series data Time series collections are designed to be efficient, effective, and easy to use. For example, these collections utilize a columnar storage format that is optimized for time series data. This approach ensures efficiency in all database operations, including queries, input/output, WiredTiger cache usage, and storage footprints for both data and secondary indexes. Let’s look, for example, at how querying time series data collections works. When a query is executed, two things happen behind the scenes: Bucket unpacking and query rewrites. To begin with, time series collections will automatically unpack buckets — similar to the $unwind command. MongoDB will unscroll compressed data, sort it, and return it to the format in which it was inserted, so that it is easier for users to read and parse. Query rewrites work alongside bucket unpacking to ensure efficiency. To avoid unpacking too many documents (which exacts a toll in time and resource usage), query rewrites use indexes on fields such as timestamps to automatically eliminate buckets that fall outside the desired range. For example, if you are searching for average winter temperatures in Valencia, Spain from 1980 to 2020, you can exclude all temperatures from the spring, summer, and fall months. Now that we’ve examined several reasons to consider MongoDB time series collections, we’ll look at two specific use cases. Use case 1: Algorithmic trading Algorithmic trading is a major use case for time series data, and this market is predicted to grow to $15 billion by 2028 . The strength of algorithms lies in their speed and automation; they reduce the possibility of mistakes stemming from human emotions or reaction time and allow for trading frequency beyond what a human can manage. Trading algorithms also generate vast volumes of time series data, which cannot necessarily be deleted, due to compliance and forecasting needs. MongoDB, however, lets you set archival parameters, to automatically move that data into cheaper cloud object storage after a preset interval of time. This approach preserves your valuable storage space for more recent data. Using MongoDB products such as Atlas, materialized views, time series collections, and triggers, it is also possible to build a basic trading algorithm. Basically, time series data will be fed into this algorithm, and when the conditions are ideal, the algorithm can buy or sell as needed, thus executing a series of individual trades with cumulative profits and losses (P&L). Although you’ll need a Java app to actually execute the trades, MongoDB can provide a strong foundation on which to build. The structure of such an algorithm is simple. Time series data is loaded from a live feed into MongoDB Atlas, which will then input it into a materialized view to calculate the averages that will serve as the basis of your trades. You can also add a scheduled trigger to execute when new data arrives, thereby refreshing your materialized views, keeping your algorithm up to date, and not losing out on any buying/selling opportunities. To learn more, watch Wojciech Witoszynski’s MongoDB World 2022 presentation on building a simple trading algorithm using MongoDB Atlas, “Algorithmic Trading Made Easy.” Use case 2: IoT Due to the nature of IoT data, such as frequent sensor readings at fixed times throughout a day, IoT applications are ideally suited for time series collections. For example, Confluent, a leading streaming data provider, uses its platform alongside MongoDB Atlas Device Sync , mobile development services, time series collections, and triggers to gather, organize, and analyze IoT data from edge devices. IoT apps often feature high volumes of data taken over time from a wide range of physical sensors, which makes it easy to fill in meta fields and take advantage of densification and gap filling features as described above. MongoDB’s developer data platform also addresses many of the challenges associated with IoT use cases. To begin with, MongoDB is highly scalable, which is an important advantage, given the huge volumes of data generated by IoT devices. Furthermore, MongoDB includes key features to enable you to make the most of your IoT data in real time. These include change streams for identifying database events as they occur, or functions, which can be pre-scheduled or configured to execute instantaneously to respond to database changes and other events. For users dealing with time-based data, real-time or otherwise, MongoDB’s time series collections offer a seamless, highly optimized way to accelerate operations, remove friction, and use tools, such as triggers, to further analyze and extract value from their data. Additionally, users no longer have to manually bucket, query, or otherwise troubleshoot time series data; instead, MongoDB does all that work for them. Try MongoDB time series collections for free in MongoDB Atlas .

August 30, 2022

MongoDB Announces New Time Series Capabilities Coming for MongoDB 6.0

MongoDB 6.0 will come with a wealth of exciting new features for customers, including new capabilities for managing time series data. Notable improvements include support for secondary indexes on time series measurements, which makes it easier for users to index data that varies over time. We’ve also made read performance improvements for sort operations. These enhancements come as part of a steady drumbeat of new time series capabilities MongoDB has added since the release of MongoDB 5.0. Over the past year, we’ve released significant new capabilities that make building and running applications that rely on time series data easier for developers. These features mean developers can manage time series data with greater ease, faster speed, and lower cost than ever before. MongoDB supports time series data for the full data lifecycle, including ingestion, storage, querying, analysis, visualization, and online archiving or automated expiration as data ages. In MongoDB 6.0, extensive support for time series data will be available to all MongoDB customers, not just MongoDB Atlas customers who have opted to participate in the Rapid Release program . Keep reading to learn more about what time series data is, discover what we’ve done with time series collections since MongoDB 5.0, and preview what’s coming next. The Time Series Landscape Time series data is a sequence of measurements over a period of time with common metadata. Managing time series workloads is required across a variety of industries. Examples include sensor readings for manufacturing; vehicle-tracking device logs for transportation and logistics; data from consumer IoT devices, such as smart watches; customer interaction data in e-commerce; and financial transactions data for the securities and cryptocurrency industries. Time series data is everywhere, and companies need to collect and analyze this data to understand what is happening right now in their businesses, and to assess future needs quickly. Time series workloads typically have the following attributes: Inserts arrive in batches that are sequentially ordered by time Data sets are typically append-only Older data can be deleted or archived Queries and aggregations are typically applied to data within a specified time range With these unique qualities come a variety of challenges for developers looking to build applications that leverage time series data. One is data volume: Time series workloads can generate data many times per second, making storage capacity (and associated costs) a significant concern. Another is continuity of data. Gaps in time series data — for example, when sensors go offline — can make analyzing the data significantly more difficult. Time series workloads across numerous industries have rapidly increased. As a result, MongoDB has significantly invested in advancing our capabilities in this space and empowering developers to build best-in-class applications using time series data on MongoDB. Time Series in MongoDB 5.0 In MongoDB 5.0, we added: Time series collections: Since MongoDB’s earliest days, developers have used our platform to store time series data. But doing so efficiently required careful, expert data modeling that wasn’t achievable for every organization. Starting with MongoDB 5.0, customers gained access to a new type of collection in MongoDB specifically for time series data that can sit side by side next to other collections in a MongoDB database. Within a time series collection, writes are organized so that data from the same source is stored in the same block alongside other data points from a similar period of time. Data in the time series collections is stored in an optimized columnar format ordered by time, which powers performance at scale — drastically reducing storage and I/O for read operations. Visualization of time series data in MongoDB Charts: Many organizations working with time series data want to analyze it to diagnose issues and predict trends that affect their business. With time series data in MongoDB Charts, customers got instant visibility into trends in their time series data. Window functions and temporal expressions: The MongoDB Query API expanded with new queries for time series data — specifically, window functions for query operations on related data and temporal expressions to help users uncover hidden patterns quickly. Secondary indexing on metadata: MongoDB added support for creating secondary indexes on time series metadata. While the default time series indexes already supported queries on time and metadata as a whole, this new capability allowed users to create secondary indexes on specific metadata subfields required for more efficient query execution. Time Series in MongoDB 5.1 In MongoDB 5.1, we added: Sharding for time series collections: With the release of MongoDB 5.1, time series collections could take advantage of MongoDB’s native sharding to horizontally distribute massive data sets. This improves throughput and cardinality-handling. Nodes are co-located with data producers to support local write operations and to enforce data sovereignty controls. Multi-delete operations: Support was added for multi-delete operations against a time series collection’s metadata field. Although most time series collections are append-only, organizations need to be able to delete data to accommodate when customers invoke their right to erasure. The option to execute multi-delete operations gives developers and administrators an easy way to comply with these types of modern data privacy regulations. Atlas Online Archive support in Preview: One common challenge with time series is that the rapid proliferation of data can lead to rising storage costs. With Atlas Online Archive support for time series collections, users can define archiving policies to automatically move aged data out of the database and into lower-cost, fully managed cloud object storage. Users simply define their own archiving policy and Atlas handles all the data movement for them. This allows users to retain all of their time series data for analysis or compliance purposes while also lowering costs. Broader platform support for time series: MongoDB released broader platform support for time series data, including the ability to create time series collections directly from Atlas Data Explorer, MongoDB Compass , or the MongoDB for VS Code Extension . Time Series in MongoDB 5.2 In MongoDB 5.2, we added: Columnar compression: With the addition of columnar compression for time series, organizations are able to dramatically reduce their storage footprint. This new capability means that time series data in BSON is significantly compressed in time series buckets before undergoing even further compression for storage on disk. Columnar compression leads to a 10 to 30x improvement in WiredTiger cache usage; it also significantly improves query performance by reducing disk I/O since more data can fit in memory. New aggregation operators: The release of MongoDB 5.2 included new operators and enhancements to improve aggregation performance and reduce the number of aggregation stages needed to unlock insights. These new operators are valuable for analyzing a variety of types of data, including time series: The $top and $bottom operators allow users to compute the top and bottom N elements of a dataset and return related fields within the same query without requiring complex logic. $maxN , $minN , and accumulators such as $firstN , $lastN , which return elements while taking into account the current order of documents in a dataset. Time Series in MongoDB 5.3 In MongoDB 5.3, we added: Densification and gap-filling: It is common for time series data to be uneven (for example, when a sensor goes offline and several readings are missed). In order to perform analytics and ensure the correctness of results, however, the data being analyzed needs to be continuous. Densification and gap-filling allow users to better handle missing time series data by creating additional data points to compensate for missing values. What to Expect With Time Series in MongoDB 6.0 In MongoDB 6.0, we have added the following features for time series: Secondary indexes on measurements: MongoDB customers will be able to create a secondary or compound index on any field in a time series collection. This enables geo-indexing (for example, tracking changes over time on a fleet of vehicles or equipment). These new index types also provide improved read performance. Read performance improvements for sort operations: MongoDB 6.0 comes with optimizations to last point queries, which let the user find the most recent entry in a time series data collection. The query executes a distinct scan looking only for the last point rather than executing a scan of the full collection. Also included in this release will be a feature that enables users to leverage a clustered index or secondary index to perform sort operations on time and metadata fields in most scenarios in a more optimized way. What's Next Our work on time series will not stop with MongoDB 6.0. We will continue to empower developers to build best-in-class applications using time series data on MongoDB. In future releases, expect to hear about cluster-to-cluster replication for time series data, features to enhance scalability for time series data on MongoDB, and much more. Sign up for the MongoDB Atlas Dedicated Tier (M10+) and choose the MongoDB 6.0.0 Release Candidate to get all the latest innovations for time series workloads on MongoDB.

June 7, 2022

Time Series Data and MongoDB: Part 2 – Schema Design Best Practices

For information about our new time series features launched with MongoDB 5.0, please check out this blog . In the previous blog post, “ Time Series Data and MongoDB: Part 1 – An Introduction ,” we introduced the concept of time-series data followed by some discovery questions you can use to help gather requirements for your time-series application. Answers to these questions help guide the schema and MongoDB database configuration needed to support a high-volume production application deployment. In this blog post we will focus on how two different schema designs can impact memory and disk utilization under read, write, update, and delete operations. In the end of the analysis you may find that the best schema design for your application may be leveraging a combination of schema designs. By following the recommendations we lay out below, you will have a good starting point to develop the optimal schema design for your app, and appropriately size your environment. Designing a time-series schema Let’s start by saying that there is no one canonical schema design that fits all application scenarios. There will always be trade-offs to consider regardless of the schema you develop. Ideally you want the best balance of memory and disk utilization to yield the best read and write performance that satisfy your application requirements, and that enables you to support both data ingest and analysis of time-series data streams. In this blog post we will look at various schema design configurations. First, storing one document per data sample, and then bucketing the data using one document per time-series time range and one document per fixed size. Storing more than one data sample per document is known as bucketing. This will be implemented at the application level and requires nothing to be configured specifically in MongoDB. With MongoDB’s flexible data model you can optimally bucket your data to yield the best performance and granularity for your application’s requirements. This flexibility also allows your data model to adapt to new requirements over time – such as capturing data from new hardware sensors that were not part of the original application design. These new sensors provide different metadata and properties than the sensors you used in the original design. WIth all this flexibility you may think that MongoDB databases are the wild west, where anything goes and you can quickly end up with a database full of disorganized data. MongoDB provides as much control as you need via schema validation that allows you full control to enforce things like the presence of mandatory fields and range of acceptable values, to name a few. To help illustrate how schema design and bucketing affects performance, consider the scenario where we want to store and analyze historical stock price data. Our sample stock price generator application creates sample data every second for a given number of stocks that it tracks. One second is the smallest time interval of data collected for each stock ticker in this example. If you would like to generate sample data in your own environment, the StockGen tool is available on GitHub. It is important to note that although the sample data in this document uses stock ticks as an example, you can apply these same design concepts to any time-series scenario like temperature and humidity readings from IoT sensors. The StockGen tool used to generate sample data will generate the same data and store it in two different collections: StockDocPerSecond and StockDocPerMinute that each contain the following schemas: Scenario 1: One document per data point { "_id" : ObjectId("5b4690e047f49a04be523cbd"), "p" : 56.56, "symbol" : "MDB", "d" : ISODate("2018-06-30T00:00:01Z") }, { "_id" : ObjectId("5b4690e047f49a04be523cbe"), "p" : 56.58, "symbol" : "MDB", "d" : ISODate("2018-06-30T00:00:02Z") }, ... Figure 1: Sample documents representing one document per second granularity Scenario 2: Time-based bucketing of one document per minute { "_id" : ObjectId("5b5279d1e303d394db6ea0f8"), "p" : { "0" : 56.56, "1" : 56.56, "2" : 56.58, … "59" : 57.02 }, "symbol" : "MDB", "d" : ISODate("2018-06-30T00:00:00Z") }, { "_id" : ObjectId("5b5279d1e303d394db6ea134"), "p" : { "0" : 69.47, "1" : 69.47, "2" : 68.46, ... "59" : 69.45 }, "symbol" : "TSLA", "d" : ISODate("2018-06-30T00:01:00Z") }, ... Figure 2: Sample documents representing one minute granularity Note that the field “p” contains a subdocument with the values for each second of the minute. Schema design comparisons Let’s compare and contrast the database metrics of storage size and memory impact based off of 4 weeks of data generated by the StockGen tool. Measuring these metrics is useful when assessing database performance. Effects on Data Storage In our application the smallest level of time granularity is a second. Storing one document per second as described in Scenario 1 is the most comfortable model concept for those coming from a relational database background. That is because we are using one document per data point, which is similar to a row per data point in a tabular schema. This design will produce the largest number of documents and collection size per unit of time as seen in Figures 3 and 4. Figure 3: Document count over time, comparing per second vs per minute schema design Figure 4: Comparison between data size and storage size for each scenario Figure 4 shows two sizes per collection. The first value in the series is the size of the collection that is stored on disk, while the second value is the size of the data in the database. These numbers are different because MongoDB’s WiredTiger storage engine supports compression of data at rest. Logically the PerSecond collection is 605MB, but on disk it is consuming around 190 MB of storage space. Effects on memory utilization A large number of documents will not only increase data storage consumption but increase index size as well. An index was created on each collection and covered the symbol and date fields. Unlike some key-value databases that position themselves as time-series databases, MongoDB provides secondary indexes giving you flexible access to your data and allowing you to optimize query performance of your application. Figure 5: Index Size (MB) comparison between PerSecond and PerMinute The size of the index defined in each of the two collections is seen in Figure 5. Optimal performance of MongoDB happens when indexes and most recently used documents fit into the memory allocated by the WiredTiger cache (we call this the “working set”). In our example we generated data for just 5 stocks over the course of 4 weeks. Given this small test case our data already generated an index that is 103MB in size for the PerSecond scenario. Keep in mind that there are some optimizations such as index prefix compression that help reduce the memory footprint of an index. However, even with these kind of optimizations proper schema design is important to prevent runaway index sizes. Given the trajectory of growth, any changes to the application requirements, like tracking more than just 5 stocks or more than 4 weeks of prices in our sample scenario, will put much more pressure on memory and eventually require indexes to page out to disk. When this happens your performance will be degraded. To mitigate this situation, consider scaling horizontally. Scale horizontally As your data grows in size you may end up scaling horizontally when you reach the limits of the physical limits of the server hosting the primary mongod in your MongoDB replica set. By horizontally scaling via MongoDB Sharding , performance can be improved since the indexes and data will be spread over multiple MongoDB nodes. Queries are no longer directed at a specific primary node. Rather they are processed by an intermediate service called a query router (mongos), which sends the query to the specific nodes that contain the data that satisfy the query. Note that this is completely transparent to the application – MongoDB handles all of the routing for you Scenario 3: Size-based bucketing The key takeaway when comparing the previous scenarios is that bucketing data has significant advantages. Time-based bucketing as described in scenario 2 buckets an entire minute's worth of data into a single document. In time-based applications such as IoT, sensor data may be generated at irregular intervals and some sensors may provide more data than others. In these scenarios, time-based bucketing may not be the optimal approach to schema design. An alternative strategy is size-based bucketing. With size-based bucketing we design our schema around one document per a certain number of emitted sensor events, or for the entire day, whichever comes first. To see size-based bucketing in action, consider the scenario where you are storing sensor data and limiting the bucket size to 200 events per document, or a single day (whichever comes first). Note: The 200 limit is an arbitrary number and can be changed as needed, without application changes or schema migrations. { _id: ObjectId(), deviceid: 1234, sensorid: 3, nsamples: 5, day: ISODate("2018-08-29"), first:1535530412, last: 1535530432, samples : [ { val: 50, time: 1535530412}, { val: 55, time : 1535530415}, { val: 56, time: 1535530420}, { val: 55, time : 1535530430}, { val: 56, time: 1535530432} ] } Figure 6: Size-based bucketing for sparse data An example size-based bucket is shown in figure 6. In this design, trying to limit inserts per document to an arbitrary number or a specific time period may seem difficult; however, it is easy to do using an upsert, as shown in the following code example: sample = {val: 59, time: 1535530450} day = ISODate("2018-08-29") db.iot.updateOne({deviceid: 1234, sensorid: 3, nsamples: {$lt: 200}, day: day}, { $push: { samples: sample}, $min: { first: sample.time}, $max: { last: sample.time}, $inc: { nsamples: 1} }, { upsert: true } ) Figure 7: Sample code to add to the size-based bucket As new sensor data comes in it is simply appended to the document until the number of samples hit 200, then a new document is created because of our upsert:true clause. The optimal index in this scenario would be on {deviceid:1,sensorid:1,day:1,nsamples:1} . When we are updating data, the day is an exact match, and this is super efficient. When querying we can specify a date, or a date range on a single field which is also efficient as well as filtering by first and last using UNIX timestamps. Note that we are using integer values for times. These are really times stored as a UNIX timestamp and only take 32 bits of storage versus an ISODate which takes 64 bits. While not a significant query performance difference over ISODate, storing as UNIX timestamp may be significant if you plan on ending up with terabytes of ingested data and you do not need to store a granularity less than a second. Bucketing data in a fixed size will yield very similar database storage and index improvements as seen when bucketing per time in scenario 2. It is one of the most efficient ways to store sparse IoT data in MongoDB. What to do with old data Should we store all data in perpetuity? Is data older than a certain time useful to your organization? How accessible should older data be? Can it be simply restored from a backup when you need it, or does it need to be online and accessible to users in real time as an active archive for historical analysis? As we covered in part 1 of this blog series, these are some of the questions that should be asked prior to going live. There are multiple approaches to handling old data and depending on your specific requirements some may be more applicable than others. Choose the one that best fits your requirements. Pre-aggregation Does your application really need a single data point for every event generated years ago? In most cases the resource cost of keeping this granularity of data around outweighs the benefit of being able to query down to this level at any time. In most cases data can be pre-aggregated and stored for fast querying. In our stock example, we may want to only store the closing price for each day as a value. In most architectures, pre-aggregated values are stored in a separate collection since typically queries for historical data are different than real-time queries. Usually with historical data, queries are looking for trends over time versus individual real-time events. By storing this data in different collections you can increase performance by creating more efficient indexes as opposed to creating more indexes on top of real-time data. Offline archival strategies When data is archived, what is the SLA associated with retrieval of the data? Is restoring a backup of the data acceptable or does the data need to be online and ready to be queried at any given time? Answers to these questions will help drive your archive design. If you do not need real-time access to archival data you may want to consider backing up the data and removing it from the live database. Production databases can be backed up using MongoDB Ops Manager or if using the MongoDB Atlas service you can use a fully managed backup solution. Removing documents using remove statement Once data is copied to an archival repository via a database backup or an ETL process, data can be removed from a MongoDB collection via the remove statement as follows: db.StockDocPerSecond.remove ( { "d" : { $lt: ISODate( "2018-03-01" ) } } ) In this example all documents that have a date before March 1st, 2018 defined on the “d” field will be removed from the StockDocPerSecond collection. You may need to set up an automation script to run every so often to clean out these records. Alternatively, you can avoid creating automation scripts in this scenario by defining a time to live (TTL) index. Removing documents using a TTL Index A TTL index is similar to a regular index except you define a time interval to automatically remove documents from a collection. In the case of our example, we could create a TTL index that automatically deletes data that is older than 1 week. db.StockDocPerSecond.createIndex( { "d": 1 }, { expireAfterSeconds: 604800 } ) Although TTL indexes are convenient, keep in mind that the check happens every minute or so and the interval cannot be configured. If you need more control so that deletions won’t happen during specific times of the day you may want to schedule a batch job that performs the deletion in lieu of using a TTL index. Removing documents by dropping the collection It is important to note that using the remove command or TTL indexes will cause high disk I/O. On a database that may be under high load already this may not be desirable. The most efficient and fastest way to remove records from the live database is to drop the collection . If you can design your application such that each collection represents a block of time, when you need to archive or remove data all you need to do is drop the collection. This may require some smarts within your application code to know which collections should be queried, but the benefit may outweigh this change. When you issue a remove, MongoDB also has to remove data from all affected indexes as well and this could take a while depending on the size of data and indexes. Online archival strategies If archival data still needs to be accessed in real time, consider how frequently these queries occur and if storing only pre-aggregated results can be sufficient. Sharding archival data One strategy for archiving data and keeping the data accessible real-time is by using zoned sharding to partition the data. Sharding not only helps with horizontally scaling the data out across multiple nodes, but you can tag shard ranges so partitions of data are pinned to specific shards. A cost saving measure could be to have the archival data live on shards running lower cost disks and periodically adjusting the time ranges defined in the shards themselves. These ranges would cause the balancer to automatically move the data between these storage layers, providing you with tiered, multi-temperature storage. Review our tutorial for creating tiered storage patterns with zoned sharding for more information. Accessing archived data via queryable backups If your archive data is not accessed that frequently and the query performance does not need to meet any strict latency SLAs, consider backing the data up and using the Queryable Backups feature of MongoDB Atlas or MongoDB OpsManager. Queryable Backups allow you to connect to your backup and issue read-only commands to the backup itself, without having to first restore the backup. Querying data from the data lake MongoDB is an inexpensive solution not only for long term archival but for your data lake as well. Companies who have made investments in technologies like Apache Spark can leverage the MongoDB Spark Connector . This connector materializes MongoDB data as DataFrames and Datasets for use with Spark and machine learning, graph, streaming, and SQL APIs. Key Takeaways Once an application is live in production and is multiple terabytes in size, any major change can be very expensive from a resource standpoint. Consider the scenario where you have 6 TB of IoT sensor data and are accumulating new data at a rate of 50,000 inserts per second. Performance of reads is starting to become an issue and you realize that you have not properly scaled out the database. Unless you are willing to take application downtime, a change of schema in this configuration – e.g., moving from raw data storage to bucketed storage – may require building out shims, temporary staging areas and all sorts of transient solutions to move the application to the new schema. The moral of the story is to plan for growth and properly design the best time-series schema that fits your application’s SLAs and requirements. This article analyzed two different schema designs for storing time-series data from stock prices. Is the schema that won in the end for this stock price database the one that will be the best in your scenario? Maybe. Due to the nature of time-series data and the typical rapid ingestion of data the answer may in fact be leveraging a combination of collections that target a read or write heavy use case. The good news is that with MongoDB’s flexible schema, it is easy to make changes. In fact you can run two different versions of the app writing two different schemas to the same collection. However, don’t wait until your query performance starts suffering to figure out an optimal design as migrating TBs of existing documents into a new schema can take time and resources, and delay future releases of your application. You should undertake real world testing before commiting on a final design. Quoting a famous proverb, “Measure twice and cut once.” In the next blog post, “ Querying, Analyzing, and Presenting Time-Series Data with MongoDB ,” we will look at how to effectively get value from the time-series data stored in MongoDB. Key Tips : The MMAPV1 storage engine is deprecated, so use the default WiredTiger storage engine. Note that if you read older schema design best practices from a few years ago, they were often built on the older MMAPV1 technology. Understand what the data access requirements are from your time-series application. Schema design impacts resources. “Measure twice and cut once” with respect to schema design and indexes. Test schema patterns with real data and a real application if possible. Bucketing data reduces index size and thus massively reduces hardware requirements. Time-series applications traditionally capture very large amounts of data, so only create indexes where they will be useful to the app’s query patterns. Consider more than one collection: one focused on write heavy inserts and recent data queries and another collection with bucketed data focused on historical queries on pre-aggregated data. When the size of your indexes exceeds the amount of memory on the server hosting MongoDB, consider horizontally scaling out to spread the index and load over multiple servers. Determine at what point data expires, and what action to take, such as archival or deletion.

September 13, 2018

Time Series Data and MongoDB: Part 1 – An Introduction

For information about our new time series features launched with MongoDB 5.0, please check out this blog . Time-series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time-series data can enable organizations to better detect and respond to events ahead of their competitors, or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe. This blog series seeks to provide these best practices as you build out your time series application on MongoDB: Introduce the concept of time-series data, and describe some of the challenges associated with this type of data How to query, analyze and present time-series data Provide discovery questions that will help you gather technical requirements needed for successfully delivering a time-series application. What is time-series data? While not all data is time-series in nature, a growing percentage of it can be classified as time-series – fueled by technologies that allow us to exploit streams of data in real time rather than in batches. In every industry and in every company there exists the need to query, analyze and report on time-series data. Consider a stock day trader constantly looking at feeds of stock prices over time and running algorithms to analyze trends and identify opportunities. They are looking at data over a time interval, e.g. hourly or daily ranges. A connected car company might obtain telemetry such as engine performance and energy consumption to improve component design, and monitor wear rates so they can schedule vehicle servicing before problems occur. They are also looking at data over a time interval. Why is time-series data challenging? Time-series data can include data that is captured at constant time intervals – like a device measurement per second – or at irregular time intervals like those generated from alerts and auditing event use cases. Time-series data is also often tagged with attributes like the device type and location of the event, and each device may provide a variable amount of additional metadata. Data model flexibility to meet diverse and rapidly changing data ingestion and storage requirements make it difficult for traditional relational (tabular) database systems with a rigid schema to effectively handle time-series data. Also, there is the issue of scalability. With a high frequency of readings generated by multiple sensors or events, time series applications can generate vast streams of data that need to be ingested and analyzed. So platforms that allow data to be scaled out and distributed across many nodes are much more suited to this type of use case than scale-up, monolithic tabular databases. Time series data can come from different sources, with each generating different attributes that need to be stored and analyzed. Each stage of the data lifecycle places different demands on a database – from ingestion through to consumption and archival. During data ingestion, the database is primarily performing write intensive operations, comprising mainly inserts with occasional updates. Consumers of data may want to be alerted in real time when an anomaly is detected in the data stream during ingestion, such as a value exceeding a certain threshold. As more data is ingested consumers may want to query it for specific insights, and to uncover trends. At this stage of the data lifecycle, the workload is read, rather than write heavy, but the database will still need to maintain high write rates as data is concurrently ingested and then queried. Consumers may want to query historical data and perform predictive analytics leveraging machine learning algorithms to anticipate future behavior or identify trends. This will impose additional read load on the database. In the end, depending on the application’s requirements, the data captured may have a shelf life and needs to be archived or deleted after a certain period of time. As you can see working with time-series data is not just simply storing the data, but requires a wide range of data platform capabilities including handling simultaneous read and write demands, advanced querying capabilities, and archival to name a few. Who is using MongoDB for time-series data? MongoDB provides all the capabilities needed to meet the demands of a highly performing time-series applications. One company that took advantage of MongoDB’s time series capabilities is Quantitative Investment Manager Man AHL. Man AHL’s Arctic application leverages MongoDB to store high frequency financial services market data (about 250M ticks per second ). The hedge fund manager’s quantitative researchers (“quants”) use Arctic and MongoDB to research, construct and deploy new trading models in order to understand how markets behave. With MongoDB, Man AHL realized a 40x cost saving when compared to an existing proprietary database. In addition to cost savings, they were able to increase processing performance by 25x over the previous solution. Man AHL open sourced their Arctic project on GitHub . Bosch Group is a multinational engineering conglomerate with nearly 300,000 employees and is the world’s largest automotive components manufacturer. IoT is a strategic initiative at Bosch, and so the company selected MongoDB as the data platform layer in its IoT suite . The suite powers IoT applications both within the Bosch group and in many of its customers in industrial internet applications, such as automotive, manufacturing, smart city, precision agriculture, and more. If you want to learn more about the key challenges presented by managing diverse, rapidly changing and high volume time series data sets generated by IoT platforms, download the Bosch and MongoDB whitepaper . Siemens is a global company focusing on the areas of electrification, automation and digitalization. Siemens developed “Monet,” a platform backed by MongoDB that provides advanced energy management services. Monet uses MongoDB for real time raw data storage, querying and analytics. Focus on application requirements When working with time-series data it is imperative that you invest enough time to understand how data is going to be created, queried, and expired. With this information you can optimize your schema design and deployment architecture to best meet the application’s requirements. You should not agree to performance metrics or SLAs without capturing the application’s requirements. As you begin your time-series project with MongoDB you should get answers to the following questions: Write workload What will the ingestion rate be? How many inserts and updates per second? As the rate of inserts increases, your design may benefit from horizontal scaling via MongoDB auto-sharding , allowing you to partition and scale your data across many nodes How many simultaneous client connections will there be? While a single MongoDB node can handle many simultaneous connections from tens of thousands of IoT devices, you need to consider scaling those out with sharding to meet the expected client load. Do you need to store all raw data points or can data be pre-aggregated? If pre-aggregated, what summary level of granularity or interval is acceptable to store? Per minute? Every 15 minutes? MongoDB can store all your raw data if your application requirements justify this. However, keep in mind that reducing the data size via pre-aggregation will yield lower dataset and index storage and an increase in query performance. What is the size of data stored in each event? MongoDB has an individual document size limit of 16 MB. If your application requires storing larger data within a single document, such as binary files you may want to leverage MongoDB GridFS . Ideally when storing high volume time-series data it is a best practice to keep the document size small, around 1 disk block in size. Read workload: How many read queries per second? A higher read query load may benefit from additional indexes or horizontal scaling via MongoDB auto-sharding. As with write volumes, reads can be scaled with auto-sharding. You can also distribute read load across secondary replicas in your replica set. Will clients be geographically dispersed or located in the same region? You can reduce network read latency by deploying read-only secondary replicas that are geographically closer to the consumers of the data. What are the common data access patterns you need to support? For example, will you retrieve data by a single value such as time, or do you need more complex queries where you look for data by a combination of attributes, such as event class, by region, by time? Query performance is optimal when proper indexes are created. Knowing how data is queried and defining the proper indexes is critical to database performance. Also, being able to modify indexing strategies in real time, without disruption to the system, is an important attribute of a time-series platform. What analytical libraries or tools will your consumers use? If your data consumers are using tools like Hadoop or Spark, MongoDB has a MongoDB Spark Connector that integrates with these technologies. MongoDB also has drivers for Python , R , Matlab and other platforms used for analytics and data science. Does your organization use BI visualization tools to create reports or analyze the data? MongoDB integrates with most of the major BI reporting tools including Tableau, QlikView, Microstrategy, TIBCO, and others via the MongoDB BI Connector. MongoDB also has a native BI reporting tool called MongoDB Charts, which provides the fastest way to visualize your data in MongoDB without needing any third-party products. Data retention and archival: What is the data retention policy? Can data be deleted or archived? If so, at what age? If archived, for how long and how accessible should the archive be? Does archive data need to be live or can it be restored from a backup? There are various strategies to remove and archive data in MongoDB. Some of these strategies include using TTL indexes , Queryable Backups , zoned sharding (allowing you to create a tiered storage pattern), or simply creating an architecture where you just drop the collection of data when no longer needed. Security: What users and roles need to be defined, and what is the least privileged permission needed for each of these entities? What are the encryption requirements? Do you need to support both in-flight (network) and at-rest (storage) encryption of time series data? Do all activities against the data need to be captured in an audit log? Does the application need to conform with GDPR, HIPAA, PCI, or any other regulatory framework? The regulatory framework may require enabling encryption, auditing, and other security measures. MongoDB supports the security configurations necessary for these compliances, including encryption at rest and in flight, auditing, and granular role-based access control controls. While not an exhaustive list of all possible things to consider, it will help get you thinking about the application requirements and their impact on the design of the MongoDB schema and database configuration. In the next blog post, " Part 2: Schema design for time-series data in MongoDB ” we will explore a variety of ways to architect a schema for different sets of requirements, and their corresponding effects on the application’s performance and scale. In part 3, " Time Series Data and MongoDB: Part 3 – Querying, Analyzing, and Presenting Time-Series Data ", we will show how to query, analyze and present time-series data.

September 6, 2018