PLAID, Inc. Optimizes Real-Time Data With MongoDB Atlas Stream Processing

Clark Gates-George
July 17, 2025

A MongoDB customer since 2015, Tokyo, Japan-based PLAID, Inc. works to “maximize the value of people with the power of data,” according to the company’s mission statement. PLAID’s customer experience platform, KARTE, analyzes and visualizes website and application users’ data in real time, offering the company’s customers a one-stop solution that helps them better understand their customers and provide personalized experiences.

After running a self-hosted instance of MongoDB for several years, in 2021, PLAID adopted MongoDB Atlas, a fully managed suite of cloud database services. Subsequently, however, the company ran into real-time data challenges.

Specifically, PLAID faced challenges when trying to migrate an existing batch processing system that sent real-time data from MongoDB Atlas to Google BigQuery, which helps organizations “go from data to AI action faster.” While their initial cloud setup with Kafka connectors provided valuable streaming capabilities by capturing events from MongoDB and streaming them to BigQuery, the complexity tied to the number of pipelines became a concern. The staging environment, which required duplicate pipelines, further exacerbated the issue, and rising costs could hinder PLAID's ability to scale and expand its real-time data processing system efficiently.

Easy event data processing with Atlas Stream Processing

To address these challenges, PLAID turned to MongoDB Atlas Stream Processing, which enables development teams to process streams of complex data using the same query API used in their MongoDB Atlas databases.

Atlas Stream Processing provided PLAID with a cost-effective way of acquiring and processing event data in real time, all while being natively integrated within their existing MongoDB Atlas environment for a seamless developer experience. This allowed them to replace some of their costly Kafka source connectors while maintaining the overall data flow to BigQuery via their existing Confluent Cloud Kafka setup.

Key aspects of the solution included:

Replacing Kafka source connectors: Atlas Stream Processing efficiently captures event data from MongoDB Atlas databases and writes them to Kafka, reducing costs associated with the previous Kafka source connectors.
MongoDB Atlas Stream Processing:
- Stream processing instance (SPI): PLAID used SPIs, where cost is determined by the instance tier and the number of workers, which in turn depends on the number of stream processors. This offered a more optimized cost structure compared to the previous connector-task-based pricing.
- Connection management: Atlas Stream Processing simplifies connection management. Connecting to Atlas databases is straightforward, and a single connection can be used for the Kafka cluster.
- Stream processors: These processing units perform data transformation and routing with the same aggregation pipelines used by MongoDB databases. Thus, the PLAID team leveraged their existing MongoDB knowledge to define pipeline logic, making the transition smoother.
Custom backfill mechanism: To address the lack of a backfill feature in Stream Processing, PLAID developed a custom application to synchronize existing data.
Custom metric collection: Since native monitoring integration with Datadog was unavailable, PLAID created a bot to collect Atlas Stream Processing metrics and send them to Datadog for monitoring and alerting.

Atlas Stream Processing provided us with a robust solution for real-time data processing, which has significantly reduced costs and improved scalability throughout our platform.
Hajime Shiozawa, senior software engineer, PLAID, Inc.

The outcome: Lower costs, improved efficiency

By implementing MongoDB Atlas Stream Processing, PLAID achieved significant improvements. These include everything from reduced costs to operational efficiencies:

Reduced costs: PLAID eliminated the cost structure that was proportional to the number of pipelines, resulting in substantial cost savings. The new cost model based on Atlas Stream Processing workers offered a more scalable and predictable pricing structure.
Improved scalability: The optimized architecture allowed PLAID to scale their real-time data processing system efficiently, supporting the addition of new products and Atlas clusters without escalating costs.
Simplified management: Because Stream Processing is a native MongoDB Atlas capability, it simplified connection management and pipeline configuration, reducing operational overhead.
Stable operation: PLAID successfully deployed and operated more than 20 pipelines, processing over 3 million events per day to BigQuery.
Enhanced real-time data capabilities: The improved system strengthened the real-time nature of their data, improving operational efficiency.

MongoDB Atlas Stream Processing provided PLAID with a robust and cost-effective solution for real-time data processing to BigQuery. By replacing costly Kafka Source Connectors and optimizing their architecture, PLAID significantly reduced costs and improved scalability. The seamless integration with MongoDB Atlas and the developer-friendly API further enhanced their operational efficiency. PLAID’s success with Atlas Stream Processing demonstrates that it is a valuable tool for organizations that are looking to streamline their data integration pipelines and leverage real-time data effectively.

To learn how Atlas Stream Processing helps organizations integrate MongoDB with Apache Kafka to build event-driven applications, see the MongoDB Atlas Stream Processing page.

← Previous

Embedded Objects and Other Index Gotchas

In a recent design review , the customer's application was in production, but performance had taken a nosedive as data volumes grew. It turned out that the issue was down to how they were indexing the embedded objects in their documents. This article explains why their indexes were causing problems, and how they could be fixed. Note that I've changed details for this use case to obfuscate the customer and application. All customer information shared in a design review is kept confidential. We looked at the schema, and things looked good. They'd correctly split their claim information across two documents: One contained a modest amount of queryable data (20 KB per claim). These documents included the _id of the second document in case the application needed to fetch it (which was relatively rare). The second contained the bulky raw data that's immutable, unindexed, and rarely read. They had 110K queryable documents in the first collection—claims. With 2.2 GB of documents (before compression, which only reduces on-disk size) and 4 GB of cache, there shouldn't have been any performance issues. We looked at some of the queries, and there was a pretty wide set of keys being filtered on and in different combinations, but none of them returned massive numbers of documents. Some queries were taking tens of seconds. It made no sense. Even a full collection scan should take well under a second for this configuration. And they'd even added indexes for their common queries. So then, we looked at the indexes… Figure 1. Collection size report in MongoDB Atlas. 15 indexes on one collection is on the high side and could slow down your writes, but it's the read performance that we were troubleshooting. But, those 15 indexes are consuming 85 GB of space. With the 4 GB of cache available on their M30 Atlas nodes, that’s a huge problem! There wasn't enough RAM in the system for the indexes to fit in cache. The result was that when MongoDB navigated an index, it would repeatedly hit branches that weren't yet in memory and then have to fetch them from disk. That’s slow. Taking a look at one of the indexes… Figure 2. Index definition in MongoDB Atlas. It's a compound index on six fields, but the first five of those fields are objects, and the sixth is an array of objects—this explains why the indexes were so large. Avoiding indexes on objects Even ignoring the size of the index, adding objects to an index can be problematic. Querying on embedded objects doesn't behave in the way that many people expect. If an index on an embedded object is to be used, then the query needs to include every field in the embedded object. E.g., if I execute this query, then it matches exactly one of the documents in the database: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); It delivers this result: { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } The explain plan confirmed that MongoDB was able to use one of the defined indexes: Figure 3. The visual explain plan tool in MongoDB Atlas displaying that the compound index on policy_holder and messages was used. If just one field from the embedded object isn't included in the query, then no documents will match: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", // "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); This resulted in no matches—though the index is at least still used. If we instead pick out individual fields from the object to query on, then we get the results we expect: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } Unfortunately, none of the indexes that included policy_holder could be used as they were indexing the value of the complete embedded object, not the individual fields within it, and so a full collection scan was performed: Figure 4. The visual explain plan too warning that no index was available. Using compound indexes instead If we instead add a compound index that leads with the fields from the object we need to filter on, then that index will be used: Figure 5. Creating an index in MongoDB Atlas. Figure 6. Explain plan providing information for the compound index. As a quick refresher on using compound indexes, that index will be used if we query on just first_name: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", // "policy_holder.last_name": "Nienow" } ); Figure 7. Explain plan showing that the compound index was used. If we don't include the first key in the compound index, then it won't be used: db.getCollection('claim').findOne( { // "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); Figure 8. Explain plan providing more information on the query. However, you can use the index if you artificially include the leading keys in the query (though it will be more efficient if last_name had been the first key in the index): db.getCollection('claim').findOne( { "policy_holder.first_name": {$exists: true}, "policy_holder.last_name": "Nienow" } ); Figure 9. Explain plan showing the data for the index. Incompletely indexed queries While having indexes for your queries is critical, there is a cost to having too many, or in having indexes that include too many fields—writes get slower and pressure increases on cache occupancy. Sometimes, it's enough to have an index that does part of the work, and then rely on a scan of the documents found by the index to check the remaining keys. For example, the policy holder’s home state isn't included in our compound index, but we can still query on it: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.location.state": "Kentucky" } ); Figure 9. Explain plan shows that the index narrowed down the problem. The explain plan shows that the index narrowed down the search from 110,000 documents to 111, which were then scanned to find the three matching documents. If it's rare for the state to be included in the query, then this can be a good solution. Partial indexes The main challenge in this design review was the size of the indexes, and so it's worth looking into another approach to limit the size of an index. Imagine that we need to be able to check on the names and email addresses of witnesses to accidents. We can add an index on the relevant fields: Figure 10. Adding an index to the relevant fields in Atlas. This index consumes 9.8 MB of cache space and must be updated when any document is added, or when any of these three fields are updated. Even if a document has null values for the indexed fields, or if the fields aren’t even present in the document, the document will still be included in the index. If we look deeper into the requirements, we might establish that we only need to query this data for fraudulent claims. That means that we're wasting space in our index for entries for all of the other claims. We can exploit this requirement by creating a partial index , setting the partial filter expression to { "claim.status": "Fraud" } . Only documents that match that pattern will be included in the index. Figure 11. Creating a partial filter in Atlas. That reduces the size of the index to 57 KB (a saving of more than 99%): Figure 12. Indexing sizing report. Note that queries must include { "claim.status": "Fraud" } for this index to be used: db.getCollection('claim').findOne( { "witnesses.email": "Sammy.Bergstrom@hotmail.com", "claim.status": "Fraud" } ); Figure 13. Explain plan providing details on the index keys and documents details. Conclusion Indexes are critical to database performance, whether you're using an RDBMS or MongoDB. MongoDB allows polymorphic documents, arrays, and embedded objects that aren't available in a traditional RDBMS. This leads to extra indexing opportunities, but also potential pitfalls. You should have indexes to optimize all of your frequent queries, but use the wrong type or too many of them and things could backfire. We saw that in this case with indexes taking up too much space and not being as general purpose as the developer believed. To compound problems, the database may perform well in development and for the early days in production. Things go wrong over time as the collections grow and extra indexes are added. As soon as the working data set (indexes and documents) doesn’t fit in the cache, performance quickly declines. Well-informed use of compound and partial indexes will ensure that MongoDB delivers the performance your application needs, even as your database grows. Learn more about MongoDB design reviews Design reviews are a chance for a design expert from MongoDB to advise you on how best to use MongoDB for your application. The reviews are focused on making you successful using MongoDB. It's never too early to request a review. By engaging us early (perhaps before you've even decided to use MongoDB), we can advise you when you have the best opportunity to act on it. This article explained how using a MongoDB schema and set of indexes that match how your application works with data can meet your performance requirements. If you want help to come up with that schema, then a design review is how to get that help. Would your application benefit from a review? Schedule your design review today . Want to read more from Andrew? Head to his website .

July 16, 2025

Next →

Scalable Automation Starts Here: Meet Stagehand and MongoDB Atlas

While APIs deliver clean, structured data on a silver platter, the most valuable insights for AI applications often hide in the messy, unstructured corners of the web. AI's potential is immense, and many organizations face a significant challenge: their existing data infrastructure isn't ready for the scale needed by AI. The web is a vast ocean of data, and ~80% of it is unstructured. But what if you could reliably automate web interactions, extract complex data, and seamlessly integrate it into a database that offers a variety of query and search methods? This is where the powerful combination of Stagehand (by Browserbase ) and MongoDB Atlas redefines what's possible for building AI applications. Figure 1. Stagehand and MongoDB Atlas: Seamlessly automating web data collection and AI-ready integration. Stagehand: An SDK for developers to write automations with natural language The browser is a powerful tool for collecting data, but it's hard to control and scale. Traditional browser automation tools like Playwright, Puppeteer, and Selenium often force developers to write fragile code that breaks with even slight UI changes on a website. This makes maintaining scripts on live websites a significant pain point. Stagehand, however, is designed specifically for the AI era. Stagehand allows you to automate browsers using a combination of natural language and code. It's built to be more reliable and adaptable than legacy frameworks, enhancing Playwright's determinism with large language models (LLMs) to account for page changes and volatility. This means you can write code once, and it adapts to that website if it changes. Key capabilities of Stagehand include: Reading and interacting with page content: Stagehand can read page content by parsing the DOM using accessibility trees, interact with, and continue to work even when the page changes. Natural language operations: You can use natural language to extract data or instruct the browser to take actions. For instance, page.extract("the price of the first cookie") or page.act("add the first cookie to cart")

September 12, 2025