5 Key Questions for App-Driven Analytics

Mat Keep and Steve Jurczak
November 21, 2022 | Updated: December 7, 2023

Note: This article originally appeared in The New Stack.

Data that powers applications and data that powers analytics typically live in separate domains in the data estate. This separation is mainly due to the fact that they serve different strategic purposes for an organization. Applications are used for engaging with customers while analytics are for insight. The two classes of workloads have different requirements—such as read and write access patterns, concurrency, and latency—therefore, organizations typically deploy purpose-built databases and duplicate data between them to satisfy the unique requirements of each use case.

As distinct as these systems are, they're also highly interdependent in today's digital economy. Application data is fed into analytics platforms where it's combined and enriched with other operational and historical data, supplemented with business intelligence (BI), machine learning (ML) and predictive analytics, and sometimes fed back to applications to deliver richer experiences. Picture, for example, an ecommerce system that segments users by demographic data and past purchases and then serves relevant recommendations when they next visit the website.

The process of moving data between the two types of systems is here to stay. But, today, that’s not enough. The current digital economy, with its seamless user experiences that customers have come to expect, requires that applications also become smarter, autonomously taking intelligent actions in real time on our behalf. Along with smarter apps, businesses want insights faster so they know what is happening “in the moment.”

To meet these demands, we can no longer rely only on copying data out of our operational systems into centralized analytics stores. Moving data takes time and creates too much separation between application events and analytical actions. Instead, analytics processing must be “shifted left” to the source of the data—to the applications themselves. We call this shift application-driven analytics. And it’s a shift that both developers and analytics teams need to be ready to embrace.

Defining required capabilities

Embracing the shift is one thing; having the capabilities to implement it is another. In this article, we break down the capabilities required to implement application-driven analytics into the following five critical questions for developers:

How do developers access the tools they need to build sophisticated analytics queries directly into their application code?
How do developers make sense of voluminous streams of time series data?
How do developers create intelligent applications that automatically react to events in real time?
How do developers combine live application data in hot database storage with aged data in cooler cloud storage to make predictions?
How can developers bring analytics into applications without compromising performance?

To take a deeper dive into app-driven analytics—including specific requirements for developers compared with data analysts and real-world success stories—download our white paper: Application-Driven Analytics.

1. How do developers access the tools they need to build sophisticated analytics queries directly into their application code?

Illustration representing sophisticated queries

To unlock the latent power of application data that exists across the data estate, developers rely on the ability to perform CRUD operations, sophisticated aggregations, and data transformations. The primary tool for delivering on these capabilities is an API that allows them to query data any way they need, from simple lookups to building more sophisticated data processing pipelines. Developers need that API implemented as an extension of their preferred programming language to remain "in the zone" as they work through problems in a flow state.

Alongside a powerful API, developers need a versatile query engine and indexing that returns results in the most efficient way possible. Without indexing, the database engine needs to go through each record to find a match. With indexing, the database can find relevant results faster and with less overhead.

Once developers start interacting with the database systematically, they need tools that can give them visibility into query performance so they can tune and optimize. Powerful tools like MongoDB Compass let users monitor real-time server and database metrics as well as visualize performance issues. Additionally, column-oriented representation of data can be used to power in-app visualizations and analytics on top of transactional data. Other MongoDB Atlas tools can be used to make performance recommendations, such as index and schema suggestions to further streamline database queries.

2. How do you make sense of voluminous streams of time series data?

Illustration representing Time Series data

Time series data is typical in many modern applications. Internet of Things (IoT) sensor data, financial trades, clickstreams, and logs enable businesses to surface valuable insights. To help, MongoDB developed the highly optimizedtime series collection type and clustered indexes. Built on a highly compressible columnar storage format, time series collections can reduce storage and I/O overhead by as much as 70%.

Developers need the ability to query and analyze this data across rolling time windows while filling any gaps in incoming data. They also need a way to visualize this data in real time to understand complex trends.

Another key requirement is a mechanism that automates the management of the time series data lifecycle. As data ages, it should be moved out of hot storage to avoid congestion on live systems; however, there is still value in that data, especially in aggregated form to provide historical analysis. So, organizations need a systematic way of tiering that data into low-cost object storage in order to maintain their ability to access and query that data for the insights it can surface.

3. How do you create intelligent applications that automatically react to events in real time?

Illustration representing Real time events

Modern applications must be able to continuously analyze data in real time as they react to live events. Dynamic pricing in a ride-hailing service, recalculating delivery times in a logistics app due to changing traffic conditions, triggering a service call when a factory machine component starts to fail, or initiating a trade when stock markets move—these are just a few examples of in-app analytics that require continuous, real-time data analysis.

MongoDB Atlas has a host of capabilities to support these requirements. With change streams, for example, all database changes are published to an API, notifying subscribing applications when an event matches predefined criteria. Atlas triggers and functions can then automatically execute application code in response to the event, allowing you to build reactive, real-time, in-app analytics.

4. How do you combine live application data in hot database storage with aged data in cooler cloud storage to make predictions?

Data is increasingly distributed across different applications, microservices, and even cloud providers. Some of that data consists of newly ingested time-series measurements or orders made in your ecommerce store and resides in hot database storage. Other data sets consist of older data that might be archived in lower cost, object cloud storage.

Organizations must be able to query, blend, and analyze fresh data coming in from microservices and IoT devices along with cooler data, APIs, and third-party data sources that reside in object stores in ways not possible with regular databases. The ability to bring all key data assets together is critical for understanding trends and making predictions, whether that's handled by a human or as part of a machine learning process.

5. How can you bring analytics into your applications without compromising their performance?

Live, customer-facing applications need to serve many concurrent users while ensuring low, predictable latency and do it consistently at scale. Any slowdown degrades customer experience and drives customers toward competitors. In one frequently cited study, Amazon found that just 100 milliseconds of extra load time cost them 1% in sales. So, it's critical that analytics queries on live data don’t affect app performance.

A distributed architecture can help you enforce isolation between the transactional and analytical sides of an application within a single database cluster. You can also use sophisticated replication techniques to move data to systems that are totally isolated but look like a single system to the app.

Next steps to app-driven analytics

As application-driven analytics becomes pervasive, the MongoDB Atlas developer data platform unifies the core data services needed to make smarter apps and improved business visibility a reality.

Atlas does this by seamlessly bridging the traditional divide between transactional and analytical workloads in an elegant and integrated data architecture. With MongoDB Atlas, you get a single platform managing a common data set for both developers and analysts.

With its flexible document data model and unified query interface, the Atlas platform minimizes data movement and duplication and eliminates data silos and architectural complexity while unlocking analytics faster and at lower cost on live operational data. It does all this while meeting the most demanding requirements for resilience, scale, and data privacy.

For more information about how to implement app-driven analytics and how the MongoDB developer data platform gives you the tools needed to succeed, download our white paper, Application-Driven Analytics.

← Previous

MongoDB World 2022 Recap — Performance Gotchas of Replicas Spanning Multiple Data Centers

Indeed has more than 25 million open jobs online at any one time. It stores more than 225 million resumes on Indeed systems, and it has 250 million unique users every month. Indeed operates enterprise-wide global clusters in the cloud across multiple availability zones all around the world, including the United States, Asia-Pacific, Europe, and Australia. Indeed is also a MongoDB super user. About 50% of everything Indeed does is built on MongoDB. In a recent session at MongoDB World 2022, Indeed senior cloud database engineer Alex Leong shared real-world experiences of performance issues when spanning replica sets across multiple data centers. He also covered how to identify these issues and, most importantly, how to fix them. This article provides highlights from Leong’s presentation, including dealing with changes in sync sources, replication lags, and more. Resilience and performance Indeed maintains multiple data centers for resiliency. Having multiple data centers ensures there's no single point of failure and keeps data in close proximity to job seekers' locations. This approach facilitates faster response times and better overall end user experience. Running multiple data centers can introduce other performance issues, however. One issue involves the initial sync of new nodes in the system, which needs to happen as quickly as possible to avoid returning stale data. Write concern is a critical consideration because, if there's an interruption on a primary node and a failover to a secondary, when you eventually roll back to the primary, any changes that were captured on the secondary while the system was running in failover mode must be preserved. Also, when you're running multiple data centers, changes in sync sources can occur that go unnoticed. Replication lags can occur when data centers are located far apart from each other. Overriding sync sources When you have an environment with hundreds of millions of users and enormous volumes of data spanning several geographic regions, spinning up and synchronizing a new node in a replica set creates logistical hurdles. To start, you have to decide where the new node syncs from. It seems logical that the default decision would be to sync with the nearest node. But, as Leong said in his session, at times you may not get the nearest sync source, and you may have to override the default sync source to choose the best one. This decision needs to be made early, Leong said, because doing so later means any progress you've made toward syncing the new node will have been wasted. Replication lags Replication lags can occur between the primary and secondary nodes for several reasons, including downtime (planned or unplanned) on the primary server, a network failure, or disk failure. Whatever the reason, there are ways to speed things up. In his session, Leong illustrates how to use the WiredTiger cache size to accelerate replication between nodes. Changes in sync sources Leong uses the term sync topology to describe how primary and secondary nodes are configured for syncing data between them. In some scenarios, a secondary node can change its sync source (sync topology) from one node to another, perhaps because the first node was busy at the time. MongoDB makes this change automatically, and it might not be noticed without looking at the log. Fixing cross-data center write concerns According to Leong, when write performance decreases, 99% of the time it's because of a change in sync sources. To be proactive, Leong creates a write performance monitor to identify and self-heal decreases in write performance so he doesn't have to find out the hard way (from users). Other critical performance issues covered in the session include chained replication , which is the process by which secondary nodes replicate from node to node, changing write concern when a secondary node goes down, and how to configure write concerns across Availability Zones in AWS. For more details, watch the complete session from MongoDB World 2022: Performance Gotchas of Replicas Spanning Multi Datacenters .

November 17, 2022

Next →

Supercharge AI Data Management With Knowledge Graphs

WhyHow.AI has built and open-sourced a platform using MongoDB, enhancing how organizations leverage knowledge graphs for data management and insights. Integrated with MongoDB, this solution offers a scalable foundation with features like vector search and aggregation to support organizations in their AI journey. Knowledge graphs address the limitations of traditional retrieval-augmented generation (RAG) systems, which can struggle to capture intricate relationships and contextual nuances in enterprise data. By embedding rules and relationships into a graph structure, knowledge graphs enable accurate and deterministic retrieval processes. This functionality extends beyond information retrieval: knowledge graphs also serve as foundational elements for enterprise memory, helping organizations maintain structured datasets that support future model training and insights. WhyHow.AI enhances this process by offering tools designed to combine large language model (LLM) workflows with Python- and JSON-native graph management. Using MongoDB’s robust capabilities, these tools help combine structured and unstructured data and search capabilities, enabling efficient querying and insights across diverse datasets. MongoDB’s modular architecture seamlessly integrates vector retrieval, full-text search, and graph structures, making it an ideal platform for RAG and unlocking the full potential of contextual data. Check out our AI Learning Hub to learn more about building AI-powered apps with MongoDB. Creating and storing knowledge graphs with WhyHow.AI and MongoDB Creating effective knowledge graphs for RAG requires a structured approach that combines workflows from LLMs, developers, and nontechnical domain experts. Simply capturing all entities and relationships from text and relying on an LLM to organize the data can lead to a messy retrieval process that lacks utility. Instead, WhyHow.AI advocates for a schema-constrained graph creation method, emphasizing the importance of developing a context-specific schema tailored to the user’s use case. This approach ensures that the knowledge graphs focus on the specific relationships that matter most to the user’s workflow. Once the knowledge graphs are created, the flexibility of MongoDB’s schema design ensures that users are not confined to rigid structures. This adaptability enables seamless expansion and evolution of knowledge graphs as data and use cases develop. Organizations can rapidly iterate during early application development without being restricted by predefined schemas. In instances where additional structure is required, MongoDB supports schema enforcement, offering a balance between flexibility and data integrity. For instance, aligning external research with patient records is crucial to delivering personalized healthcare. Knowledge graphs bridge the gap between clinical trials, best practices, and individual patient histories. New clinical guidelines can be integrated with patient records to identify which patients would benefit most from updated treatments, ensuring that the latest practices are applied to individual care plans. Optimizing knowledge graph storage and retrieval with MongoDB Harnessing the full potential of knowledge graphs requires both effective creation tools and robust systems for storage and retrieval. Here’s how WhyHow.AI and MongoDB work together to optimize the management of knowledge graphs. Storing data in MongoDB WhyHow.AI relies on MongoDB’s document-oriented structure to organize knowledge graph data into modular, purpose-specific collections, enabling efficient and flexible queries. This approach is crucial for managing complex entity relationships and ensuring accurate provenance tracking. To support this functionality, the WhyHow.AI Knowledge Graph Studio comprises several key components: Workspaces separate documents, schemas, graphs, and associated data by project or domain, maintaining clarity and focus. Chunks are raw text segments with embeddings for similarity searches, linked to triples and documents to provide evidence and provenance. Graph collection stores the knowledge graph along with metadata and schema associations, all organized by workspace for centralized data management. Schemas define the entities, relationships, and patterns within graphs, adapting dynamically to reflect new data and keep the graph relevant. Nodes represent entities like people, locations, or concepts, each with unique identifiers and properties, forming the graph’s foundation. Triples define subject-predicate-object relationships and store embedded vectors for similarity searches, enabling reliable retrieval of relevant facts. Queries log user queries, including triple results and metadata, providing an immutable history for analysis and optimization. Figure 1. WhyHow.AI platform and knowledge graph illustration. To enhance data interoperability, MongoDB’s aggregation framework enables efficient linking across collections. For instance, retrieving chunks associated with a specific triple can be seamlessly achieved through an aggregation pipeline, connecting workspaces, graphs, chunks, and document collections into a cohesive data flow. Querying knowledge graphs With the representation established, users can perform both structured and unstructured queries with the WhyHow.AI querying system. Structured queries enable the selection of specific entity types and relationships, while unstructured queries enable natural language questions to return related nodes, triples, and linked vector chunks. WhyHow.AI’s query engine embeds triples to enhance retrieval accuracy, bypassing traditional Text2Cypher methods. Through a retrieval engine that embeds triples and enables users to retrieve embedded triples with chunks tied to them, WhyHow.AI uses the best of both structured and unstructured data structures and retrieval patterns. And, with MongoDB’s built-in vector search, users can store and query vectorized text chunks alongside their graph and application data in a single, unified location. Enabling scalability, portability, and aggregations MongoDB’s horizontal scalability ensures that knowledge graphs can grow effortlessly alongside expanding datasets. Users can also easily utilize WhyHow.AI's platform to create modular multiagent and multigraph workflows. They can deploy MongoDB Atlas on their preferred cloud provider or maintain control by running it in their own environments, gaining flexibility and reliability. As graph complexity increases, MongoDB’s aggregation framework facilitates diverse queries, extracting meaningful insights from multiple datasets with ease. Providing familiarity and ease of use MongoDB’s familiarity enables developers to apply their existing expertise without the need to learn new technologies or workflows. With WhyHow.AI and MongoDB, developers can build graphs with JSON data and Python-native APIs, which are perfect for LLM-driven workflows. The same database trusted for years in application development can now manage knowledge graphs, streamlining onboarding and accelerating development timelines. Taking the next steps WhyHow.AI’s knowledge graphs overcome the limitations of traditional RAG systems by structuring data into meaningful entities, relationships, and contexts. This enhances retrieval accuracy and decision-making in complex fields. Integrated with MongoDB, these capabilities are amplified through a flexible, scalable foundation featuring modular architecture, vector search, and powerful aggregation. Together, WhyHow.AI and MongoDB help organizations unlock their data’s potential, driving insights and enabling innovative knowledge management solutions. No matter where you are in your AI journey, MongoDB can help! You can get started with your AI-powered apps by registering for MongoDB Atlas and exploring the tutorials available in our AI Learning Hub . Otherwise, head over to our quick-start guide to get started with MongoDB Atlas Vector Search today. Want to learn more about why MongoDB is the best choice for supporting modern AI applications? Check out our on-demand webinar, “ Comparing PostgreSQL vs. MongoDB: Which is Better for AI Workloads? ” presented by MongoDB Field CTO, Rick Houlihan. If your company is interested in being featured in a story like this, we’d love to hear from you. Reach out to us at ai_adopters@mongodb.com .

February 13, 2025