MongoDB Atlas Best Practices: Part 1

Mat Keep
August 8, 2016 | Updated: February 15, 2017

Preparing for your MongoDB Deployment: Schema Design & Access Patterns

MongoDB Atlas radically simplifies the operation of MongoDB. As with any hosted database as a service there are still decisions you need to take to ensure the best performance and availability for your application. This blog series provides a number of recommendations that will serve as a solid foundation for getting the most out of the MongoDB Atlas service.

We’ll cover four main areas over this series of blog posts:

In this part 1 post, we’ll get started with preparing for your deployment, focusing specifically on schema design and application access patterns.
In part 2, we’ll discuss additional considerations as you prepare for your deployment, including indexing, data migration, and instance selection.
In part 3, we’ll dive into how you scale your MongoDB Atlas deployment, and achieve your required availability SLAs.
In the final part 4, we’ll wrap up with best practices for operational management and ensuring data security.

If you want to get a head start and learn about all of these topics now, just go ahead and download the MongoDB Atlas Best Practices guide.

So What is MongoDB Atlas?

MongoDB Atlas provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on-demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on your code and your customers.

It’s easy to get started – use a simple GUI to select the appropriate instance size, geographic region, and features you need. MongoDB Atlas provides:

Security features to protect access to your data
Built in replication for always-on availability, tolerating complete data center failure
Backups and point in time recovery to protect against data corruption
Fine-grained monitoring to help you know when to scale. Additional instances can be provisioned with the push of a button
Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
A choice of cloud providers, regions, and billing options

MongoDB Atlas is versatile. It’s great for everything from a quick Proof of Concept, to test/QA environments, to complete production clusters. If you decide you want to bring operations back under your control, it is easy to move your databases onto your own infrastructure and manage them using MongoDB Ops Manager or MongoDB Cloud Manager. The user experience across MongoDB Atlas, Cloud Manager, and Ops Manager is consistent, ensuring that disruption is minimal if you decide to migrate to your own infrastructure.

So now that you know what MongoDB Atlas is, let’s get started preparing for our deployment.

Schema Design

Developers and data architects should work together to develop the right data model, and they should invest time in this exercise early in the project. The requirements of the application should drive the data model, updates, and queries of your MongoDB system. Given MongoDB's dynamic schema, developers and data architects can continue to iterate on the data model throughout the development and deployment processes to optimize performance and storage efficiency, as well as support the addition of new application features. All of this can be done without expensive schema migrations.

Document Model

MongoDB stores data as documents in a binary representation called BSON. The BSON encoding extends the popular JSON representation to include additional types such as int, long, and date. BSON documents contain one or more fields, and each field contains a value of a specific data type, including arrays, sub-documents and binary data. It may be helpful to think of documents as roughly equivalent to rows in a relational database, and fields as roughly equivalent to columns. However, MongoDB documents tend to have all related data for a given record or object in a single document, whereas in a relational database that data is usually normalized across rows in many tables. For example, data that belongs to a parent-child relationship in two RDBMS tables can frequently be collapsed (embedded) into a single document in MongoDB. For operational applications, the document model makes JOINs redundant in many cases.

Where possible, store all data for a record in a single document. MongoDB provides ACID compliance at the document level. When data for a record is stored in a single document the entire record can be retrieved in a single seek operation, which is very efficient. In some cases it may not be practical to store all data in a single document, or it may negatively impact other operations. Make the trade-offs that are best for your application.

Rather than storing a large array of items in an indexed field, storing groups of values across multiple fields results in more efficient updates.

Collections

Collections are groupings of documents. Typically all documents in a collection have similar or related purposes for an application. It may be helpful to think of collections as being analogous to tables in a relational database.

Dynamic Schema & Document Validation

MongoDB documents can vary in structure. For example, documents that describe users might all contain the user id and the last date they logged into the system, but only some of these documents might contain the user's shipping address, and perhaps some of those contain multiple shipping addresses. MongoDB does not require that all documents conform to the same structure. Furthermore, there is no need to declare the structure of documents to the system – documents are self-describing.

DBAs and developers have the option to define Document Validation rules for a collection – enabling them to enforce checks on selected parts of a document's structure, data types, data ranges, and the presence of mandatory fields. As a result, DBAs can apply data governance standards, while developers maintain the benefits of a flexible document model. These are covered in the blog post Document Validation: Adding Just the Right Amount of Control Over Your Documents.

Indexes

MongoDB uses B-tree indexes to optimize queries. Indexes are defined on a collection’s document fields. MongoDB includes support for many indexes, including compound, geospatial, TTL, text search, sparse, partial, unique, and others. For more information see the section on indexing in the 2nd instalment of this blog series.

Transactions

Atomicity of updates may influence the schema for your application. MongoDB guarantees ACID compliant updates to data at the document level. It is not possible to update multiple documents in a single atomic operation, however the ability to embed related data into MongoDB documents eliminates this requirement in many cases. For use cases that do require multiple documents to be updated atomically, it is possible to implement Two Phase Commit logic in the application.

Visualizing your Schema: MongoDB Compass

The MongoDB Compass GUI allows users to understand the structure of existing data in the database and perform ad hoc queries against it – all with zero knowledge of MongoDB's query language. Typical users could include architects building a new MongoDB project or a DBA who has inherited a database from an engineering team, and who must now maintain it in production. You need to understand what kind of data is present, define what indexes might be appropriate, and identify if Document Validation rules should be added to enforce a consistent document structure.

*Figure 1: View schema & interactively build and execute database queries with MongoDB Compass*

Without MongoDB Compass, users wishing to understand the shape of their data would have to connect to the MongoDB shell and write queries to reverse engineer the document structure, field names, and data types. Similarly, anyone wanting to run custom queries on the data would need to understand MongoDB's query language.

MongoDB Compass can be used for free during development and it is also available for production use with MongoDB Professional or MongoDB Enterprise Advanced subscriptions.

Application Access Patterns

Schema design has a huge influence on database performance. How the application accesses the data can also have a major impact.

Searching on indexed attributes is typically the single most important pattern as it avoids collection scans. Taking it a step further, using covered queries avoids the need to access the collection data altogether. Covered queries return results from the indexes directly without accessing documents and are therefore very efficient. For a query to be covered, all the fields included in the query must be present in an index, and all the fields returned by the query must also be present in that index. To determine whether a query is a covered query, use the explain() method. If the explain() output displays true for the indexOnly field, the query is covered by an index, and MongoDB queries only that index to match the query and return the results.

Rather than retrieving the entire document in your application, updating fields, then saving the document back to the database, instead issue the update to specific fields. This has the advantage of less network usage and reduced database overhead.

Document Size

The maximum BSON document size in MongoDB is 16 MB. Users should avoid certain application patterns that would allow documents to grow unbounded. For example, in an e-commerce application it would be difficult to estimate how many reviews each product might receive from customers. Furthermore, it is typically the case that only a subset of reviews is displayed to a user, such as the most popular or the most recent reviews. Rather than modeling the product and customer reviews as a single document it would be better to model each review or groups of reviews as a separate document with a reference to the product document; while also storing the key reviews in the product document for fast access.

In practice most documents are a few kilobytes or less. Consider documents more like rows in a table than the tables themselves. Rather than maintaining lists of records in a single document, instead make each record a document. For large media items, such as video or images, consider using GridFS, a convention implemented by all the drivers that automatically stores the binary data across many smaller documents

Field names are repeated across documents and consume space – RAM in particular. By using smaller field names your data will consume less space, which allows for a larger number of documents to fit in RAM.

Data Lifecycle Management

MongoDB provides features to facilitate the management of data lifecycles, including Time to Live indexes, and capped collections.

Time to Live (TTL)

If documents in a collection should only persist for a pre-defined period of time, the TTL feature can be used to automatically delete documents of a certain age rather than scheduling a process to check the age of all documents and run a series of deletes. For example, if user sessions should only exist for one hour, the TTL can be set to 3600 seconds for a date field called lastActivity that exists in documents used to track user sessions and their last interaction with the system. A background thread will automatically check all these documents and delete those that have been idle for more than 3600 seconds. Another example use case for TTL is a price quote that should automatically expire after a period of time.

Capped Collections

In some cases a rolling window of data should be maintained in the system based on data size. Capped collections are fixed-size collections that support high-throughput inserts and reads based on insertion order. A capped collection behaves like a circular buffer: data is inserted into the collection, that insertion order is preserved, and when the total size reaches the threshold of the capped collection, the oldest documents are deleted to make room for the newest documents. For example, store log information from a high-volume system in a capped collection to quickly retrieve the most recent log entries.

Dropping a Collection

It is very efficient to drop a collection in MongoDB. If your data lifecycle management requires periodically deleting large volumes of documents, it may be best to model those documents as a single collection. Dropping a collection is much more efficient than removing all documents or a large subset of a collection, just as dropping a table is more efficient than deleting all the rows in a table in a relational database.

Disk space is automatically reclaimed after a collection is dropped.

Next Steps

That’s a wrap for part 1 of the MongoDB Atlas best practices blog series. In Part 2, we’ll continue along the path of preparing for our first deployment by discussing indexing and data migration.

Download MongoDB Atlas Best Practice Guide

← Previous

Develop & Deploy a Node.js App to AWS Elastic Beanstalk & MongoDB Atlas

This post is part of our Road to re:Invent series . In the weeks leading up to AWS re:Invent in Las Vegas this November, we'll be posting about a number of topics related to running MongoDB in the public cloud. This post provides an introduction to Amazon Kinesis: its architecture, what it provides, and how it's typically used. It goes on to step through how to implement an application where data is ingested by Amazon Kinesis before being processed and then stored in MongoDB Atlas. This is part of a series of posts which examine how to use MongoDB Atlas with a number of complementary technologies and frameworks . Introduction to Amazon Kinesis The role of Amazon Kinesis is to get large volumes of streaming data into AWS where it can then be processed, analyzed, and moved between AWS services. The service is designed to ingest and store terabytes of data every hour, from multiple sources. Kinesis provides high availability, including synchronous replication within an AWS region. It also transparently handles scalability, adding and removing resources as needed. Once the data is inside AWS, it can be processed or analyzed immediately, as well as being stored using other AWS services (such as S3) for later use. By storing the data in MongoDB, it can be used both to drive real-time, operational decisions as well as for deeper analysis. As the number, variety, and velocity of data sources grow, new architectures and technologies are needed. Technologies like Amazon Kinesis and Apache Kafka are focused on ingesting the massive flow of data from multiple fire hoses and then routing it to the systems that need it – optionally filtering, aggregating, and analyzing en-route. Figure 1: AWS Kinesis Architecture Typical data sources include: IoT assets and devices(e.g., sensor readings) On-line purchases from an ecommerce store Log files Video game activity Social media posts Financial market data feeds Rather than leave this data to fester in text files, Kinesis can ingest the data, allowing it to be processed to find patterns, detect exceptions, drive operational actions, and provide aggregations to be displayed through dashboards. There are actually 3 services which make up Amazon Kinesis: Amazon Kinesis Firehose is the simplest way to load massive volumes of streaming data into AWS. The capacity of your Firehose is adjusted automatically to keep pace with the stream throughput. It can optionally compress and encrypt the data before it's stored. Amazon Kinesis Streams are similar to the Firehose service but give you more control, allowing for: Multi-stage processing Custom stream partitioning rules Reliable storage of the stream data until it has been processed. Amazon Kinesis Analytics is the simplest way to process the data once it has been ingested by either Kinesis Firehose or Streams. The user provides SQL queries which are then applied to analyze the data; the results can then be displayed, stored, or sent to another Kinesis stream for further processing. This post focuses on Amazon Kinesis Streams, in particular, implementing a consumer that ingests the data, enriches it, and then stores it in MongoDB. Accessing Kinesis Streams – the Libraries There are multiple ways to read (consume) and write (produce) data with Kinesis Streams: Amazon Kinesis Streams API Provides two APIs for putting data into an Amazon Kinesis stream: PutRecord and PutRecords Amazon Kinesis Producer Library (KPL) Easy to use and highly configurable Java library that helps you put data into an Amazon Kinesis stream. Amazon Kinesis Producer Library (KPL) presents a simple, asynchronous, high throughput, and reliable interface. Amazon Kinesis Agent The agent continuously monitors a set of files and sends new entries to your Stream or Firehose. Amazon Kinesis Client Library (KCL) A Java library that helps you easily build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis stream. KCL handles issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, providing fault-tolerance, and processing data. Amazon Kinesis Client Library MultiLangDemon The MultiLangDemon is used as a proxy by non-Java applications to use the Kinesis Client Library. Amazon Kinesis Connector Library A library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools. Amazon Kinesis Storm Spout A library that helps you easily integrate Amazon Kinesis Streams with Apache Storm. The example application in this post use the Kinesis Agent and the Kinesis Client Library MultiLangDemon (with Node.js). Role of MongoDB Atlas MongoDB is a distributed database delivering a flexible schema for rapid application development, rich queries, idiomatic drivers, and built in redundancy and scale-out. This makes it the go-to database for anyone looking to build modern applications. MongoDB Atlas is a hosted database service for MongoDB. It provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on what you do best. It’s easy to get started – use a simple GUI to select the instance size, region, and features you need. MongoDB Atlas provides: Security features to protect access to your data Built in replication for always-on availability, tolerating complete data center failure Backups and point in time recovery to protect against data corruption Fine-grained monitoring to let you know when to scale. Additional instances can be provisioned with the push of a button Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features A choice of regions and billing options Like Amazon Kinesis, MongoDB Atlas is a natural fit for users looking to simplify their development and operations work, letting them focus on what makes their application unique rather than commodity (albeit essential) plumbing. Also like Kinesis, you only pay for MongoDB Atlas when you're using it with no upfront costs and no charges after you terminate your cluster. Example Application The rest of this post focuses on building a system to process log data. There are 2 sources of log data: A simple client that acts as a Kinesis Streams producer , generating sensor readings and writing them to a stream Amazon Kinesis Agent monitoring a SYSLOG file and sending each log event to a stream In both cases, the data is consumed from the stream using the same consumer , which adds some metadata to each entry and then stores it in MongoDB Atlas. Create Kinesis IAM Policy in AWS From the IAM section of the AWS console use the wizard to create a new policy. The policy should grant permission to perform specific actions on a particular stream (in this case "ClusterDBStream") and the results should look similar to this: { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1476360711000", "Effect": "Allow", "Action": [ "kinesis:DescribeStream", "kinesis:GetShardIterator", "kinesis:GetRecords", "kinesis:PutRecord", "kinesis:PutRecords", "kinesis:CreateStream" ], "Resource": [ "arn:aws:kinesis:eu-west-1:658153047537:stream/ClusterDBStream" ] }, { "Sid": "Stmt1476360824000", "Effect": "Allow", "Action": [ "dynamodb:CreateTable", "dynamodb:DeleteItem", "dynamodb:DescribeTable", "dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:Scan", "dynamodb:UpdateItem" ], "Resource": [ "arn:aws:dynamodb:eu-west-1:658153047537:table/ClusterDBStream" ] }, { "Sid": "Stmt1476360951000", "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData" ], "Resource": [ "*" ] } ] } Next, create a new user and associate it with the new policy. Important : Take a note of the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY . Create MongoDB Atlas Cluster Register with MongoDB Atlas and use the simple GUI to select the instance size, region, and features you need (Figure 2). Figure 2: Create MongoDB Atlas Cluster Create a user with read and write privileges for just the database that will be used for your application, as shown in Figure 3. Figure 3: Creating an Application user in MongoDB Atlas You must also add the IP address of your application server to the IP Whitelist in the MongoDB Atlas security tab (Figure 4). Note that if multiple application servers will be accessing MongoDB Atlas then an IP address range can be specified in CIDR format (IP Address/number of significant bits). Figure 4: Add App Server IP Address(es) to MongoDB Atlas If your application server(s) are running in AWS, then an alternative to IP Whitelisting is to configure a VPC (Virtual Private Cloud) Peering relationship between your MongoDB Atlas group and the VPC containing your AWS resources . This removes the requirement to add and remove IP addresses as AWS reschedules functions, and is especially useful when using highly dynamic services such as AWS Lambda. Click the "Connect" button and make a note of the URI that should be used when connecting to the database (note that you will substitute the user name and password with ones that you've just created). App Part 1 – Kinesis/Atlas Consumer The code and configuration files in Parts 1 & 2 are based on the sample consumer and producer included with the client library for Node.js (MultiLangDaemon) . Install the Node.js client library : git clone https://github.com/awslabs/amazon-kinesis-client-nodejs.git cd amazon-kinesis-client-nodejs npm install Install the MongoDB Node.js Driver : npm install --save mongodb Move to the consumer sample folder: cd samples/basic_sample/consumer/ Create a configuration file ("logging_consumer.properties"), taking care to set the correct stream and application names and AWS region: # The script that abides by the multi-language protocol. This script will # be executed by the MultiLangDaemon, which will communicate with this script # over STDIN and STDOUT according to the multi-language protocol. executableName = node logging_consumer_app.js The name of an Amazon Kinesis stream to process. streamName = ClusterDBStream Used by the KCL as the name of this application. Will be used as the name of an Amazon DynamoDB table which will store the lease and checkpoint information for workers with this application name applicationName = ClusterDBStream Users can change the credentials provider the KCL will use to retrieve credentials. The DefaultAWSCredentialsProviderChain checks several other providers, which is described here: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html AWSCredentialsProvider = DefaultAWSCredentialsProviderChain Appended to the user agent of the KCL. Does not impact the functionality of the KCL in any other way. processingLanguage = nodejs/0.10 Valid options at TRIM_HORIZON or LATEST. See http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html#API_GetShardIterator_RequestSyntax initialPositionInStream = TRIM_HORIZON The following properties are also available for configuring the KCL Worker that is created by the MultiLangDaemon. The KCL defaults to us-east-1 regionName = eu-west-1 The code for working with MongoDB can be abstracted to a helper file ("db.js"): var MongoClient = require('mongodb').MongoClient; var assert = require('assert'); var logger = require('../../util/logger'); var util = require('util'); function DB() { this.db = "empty"; this.log = logger().getLogger('mongoMange-DB'); } DB.prototype.connect = function(uri, callback) { this.log.info(util.format('About to connect to DB')); if (this.db != "empty") { callback(); this.log.info('Already connected to database.'); } else { var _this = this; MongoClient.connect(uri, function(err, database) { if (err) { _this.log.info(util.format('Error connecting to DB: %s', err.message)); callback(err); } else { _this.db = database; _this.log.info(util.format('Connected to database.')); callback(); } }) } } DB.prototype.close = function(callback) { log.info('Closing database'); this.db.close(); this.log.info('Closed database'); callback(); } DB.prototype.addDocument = function(coll, doc, callback) { var collection = this.db.collection(coll); var _this = this; collection.insertOne(doc, function(err, result) { if (err) { _this.log.info(util.format('Error inserting document: %s', err.message)); callback(err.message); } else { _this.log.info(util.format('Inserted document into %s collection.', coll)); callback(); } }); }; module.exports = DB; Create the application Node.js file ("logging_consumer_app.js"), making sure to replace the database user and host details in "mongodbConnectString" with your own: 'use strict'; var fs = require('fs'); var path = require('path'); var util = require('util'); var kcl = require('../../..'); var logger = require('../../util/logger'); var DB = require('./DB.js') var mongodbConnectString = 'mongodb://kinesis-user:??????@cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/clusterdb?ssl=true&replicaSet=Cluster0-shard-0&authSource=admin' var mongodbCollection = 'logdata' var database = new DB; function recordProcessor() { var log = logger().getLogger('recordProcessor'); var shardId; return { initialize: function(initializeInput, completeCallback) { shardId = initializeInput.shardId; // WARNING – the connection string may contain the password and so consider removing logging for any production system log.info(util.format('About to connect to %s.', mongodbConnectString)); database.connect(mongodbConnectString, function(err) { log.info(util.format('Back from connecting to %s', mongodbConnectString)); if (err) { log.info(util.format('Back from connecting to %s', mongodbConnectString)); } completeCallback(); }) }, processRecords: function(processRecordsInput, completeCallback) { log.info(util.format('In processRecords', mongodbConnectString)); if (!processRecordsInput || !processRecordsInput.records) { completeCallback(); return; } var records = processRecordsInput.records; var record, data, sequenceNumber, partitionKey, objectToStore; for (var i = 0 ; i < records.length ; ++i) { record = records[i]; data = new Buffer(record.data, 'base64').toString(); sequenceNumber = record.sequenceNumber; partitionKey = record.partitionKey; log.info(util.format('ShardID: %s, Record: %s, SeqenceNumber: %s, PartitionKey:%s', shardId, data, sequenceNumber, partitionKey)); objectToStore = {}; try { objectToStore = JSON.parse(data); } catch(err) { // Looks like it wasn't JSON so store the raw string objectToStore.payload = data; } objectToStore.metaData = {}; objectToStore.metaData.mongoLabel = "Added by MongoMange"; objectToStore.metaData.timeAdded = new Date(); database.addDocument(mongodbCollection, objectToStore, function(err) {}) } if (!sequenceNumber) { completeCallback(); return; } // If checkpointing, completeCallback should only be called once checkpoint is complete. processRecordsInput.checkpointer.checkpoint(sequenceNumber, function(err, sequenceNumber) { log.info(util.format('Checkpoint successful. ShardID: %s, SeqenceNumber: %s', shardId, sequenceNumber)); completeCallback(); }); }, shutdown: function(shutdownInput, completeCallback) { // Checkpoint should only be performed when shutdown reason is TERMINATE. if (shutdownInput.reason !== 'TERMINATE') { completeCallback(); return; } // Whenever checkpointing, completeCallback should only be invoked once checkpoint is complete. database.close(function(){ shutdownInput.checkpointer.checkpoint(function(err) { completeCallback(); }); }); } }; } kcl(recordProcessor()).run(); Note that this code adds some metadata to the received object before writing it to MongoDB. At this point, it is also possible to filter objects based on any of their fields. Note also that this Node.js code logs a lot of information to the "application log" file (including the database password!); this is for debugging and would be removed from a real application. The simplest way to have the application use the user credentials (noted when creating the user in AWS IAM) is to export them from the shell where the application will be launched: export AWS_ACCESS_KEY_ID=???????????????????? export AWS_SECRET_ACCESS_KEY=???????????????????????????????????????? Finally, launch the consumer application: ../../../bin/kcl-bootstrap --java /usr/bin/java -e -p ./logging_consumer.properties Check the "application.log" file for any errors. App Part 2 – Kinesis Producer As for the consumer, export the credentials for the user created in AWS IAM: cd amazon-kinesis-client-nodejs/samples/basic_sample/producer export AWS_ACCESS_KEY_ID=???????????????????? export AWS_SECRET_ACCESS_KEY=???????????????????????????????????????? Create the configuration file ("config.js") and ensure that the correct AWS region and stream are specified: 'use strict'; var config = module.exports = { kinesis : { region : 'eu-west-1' }, loggingProducer : { stream : 'ClusterDBStream', shards : 2, waitBetweenDescribeCallsInSeconds : 5 } }; Create the producer code ("logging_producer.js"): 'use strict'; var util = require('util'); var logger = require('../../util/logger'); function loggingProducer(kinesis, config) { var log = logger().getLogger('loggingProducer'); function _createStreamIfNotCreated(callback) { var params = { ShardCount : config.shards, StreamName : config.stream }; kinesis.createStream(params, function(err, data) { if (err) { if (err.code !== 'ResourceInUseException') { callback(err); return; } else { log.info(util.format('%s stream is already created. Re-using it.', config.stream)); } } else { log.info(util.format("%s stream doesn't exist. Created a new stream with that name ..", config.stream)); } // Poll to make sure stream is in ACTIVE state before start pushing data. _waitForStreamToBecomeActive(callback); }); } function _waitForStreamToBecomeActive(callback) { kinesis.describeStream({StreamName : config.stream}, function(err, data) { if (!err) { log.info(util.format('Current status of the stream is %s.', data.StreamDescription.StreamStatus)); if (data.StreamDescription.StreamStatus === 'ACTIVE') { callback(null); } else { setTimeout(function() { _waitForStreamToBecomeActive(callback); }, 1000 * config.waitBetweenDescribeCallsInSeconds); } } }); } function _writeToKinesis() { var currTime = new Date().getMilliseconds(); var sensor = 'sensor-' + Math.floor(Math.random() * 100000); var reading = Math.floor(Math.random() * 1000000); var record = JSON.stringify({ program: "logging_producer", time : currTime, sensor : sensor, reading : reading }); var recordParams = { Data : record, PartitionKey : sensor, StreamName : config.stream }; kinesis.putRecord(recordParams, function(err, data) { if (err) { log.error(err); } else { log.info('Successfully sent data to Kinesis.'); } }); } return { run: function() { _createStreamIfNotCreated(function(err) { if (err) { log.error(util.format('Error creating stream: %s', err)); return; } var count = 0; while (count < 10) { setTimeout(_writeToKinesis(), 1000); count++; } }); } }; } module.exports = loggingProducer; The producer is launched from "logging_producer_app.js": 'use strict'; var AWS = require('aws-sdk'); var config = require('./config'); var producer = require('./logging_producer'); var kinesis = new AWS.Kinesis({region : config.kinesis.region}); producer(kinesis, config.loggingProducer).run(); Run the producer: node logging_producer_app.js Check the consumer and producer "application.log" files for errors. At this point, data should have been written to MongoDB Atlas. Using the connection string provided after clicking the "Connect" button in MongoDB Atlas, connect to the database and confirm that the documents have been added: mongo "mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0" --ssl --username kinesis-user --password ?????? use clusterdb db.logdata.find() { "_id" : ObjectId("5804d1d0aa1f330731204597"), "program" : "logging_producer", "time" : 702, "sensor" : "sensor-81057", "reading" : 639075, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.142Z") } } { "_id" : ObjectId("5804d1d0aa1f330731204598"), "program" : "logging_producer", "time" : 695, "sensor" : "sensor-805", "reading" : 981144, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.142Z") } } { "_id" : ObjectId("5804d1d0aa1f330731204599"), "program" : "logging_producer", "time" : 699, "sensor" : "sensor-2581", "reading" : 752020, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.143Z") } } { "_id" : ObjectId("5804d1d0aa1f33073120459a"), "program" : "logging_producer", "time" : 700, "sensor" : "sensor-56194", "reading" : 455700, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } } { "_id" : ObjectId("5804d1d0aa1f33073120459b"), "program" : "logging_producer", "time" : 706, "sensor" : "sensor-32956", "reading" : 113233, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } } { "_id" : ObjectId("5804d1d0aa1f33073120459c"), "program" : "logging_producer", "time" : 707, "sensor" : "sensor-96487", "reading" : 179047, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } } { "_id" : ObjectId("5804d1d0aa1f33073120459d"), "program" : "logging_producer", "time" : 697, "sensor" : "sensor-37595", "reading" : 935647, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } } { "_id" : ObjectId("5804d1d15f0fbb074446ad6d"), "program" : "logging_producer", "time" : 704, "sensor" : "sensor-92660", "reading" : 756624, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.263Z") } } { "_id" : ObjectId("5804d1d15f0fbb074446ad6e"), "program" : "logging_producer", "time" : 701, "sensor" : "sensor-95222", "reading" : 850749, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.263Z") } } { "_id" : ObjectId("5804d1d15f0fbb074446ad6f"), "program" : "logging_producer", "time" : 704, "sensor" : "sensor-1790", "reading" : 271359, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.266Z") } } App Part 3 – Capturing Live Logs Using Amazon Kinesis Agent Using the same consumer, the next step is to stream real log data. Fortunately, this doesn't require any additional code as the Kinesis Agent can be used to monitor files and add every new entry to a Kinesis Stream (or Firehose). Install the Kinesis Agent: sudo yum install –y aws-kinesis-agent and edit the configuration file to use the correct AWS region, user credentials, and stream in "/etc/aws-kinesis/agent.json": { "cloudwatch.emitMetrics": true, "kinesis.endpoint": "kinesis.eu-west-1.amazonaws.com", "cloudwatch.endpoint": "monitoring.eu-west-1.amazonaws.com", "awsAccessKeyId": "????????????????????", "awsSecretAccessKey": "????????????????????????????????????????", "flows": [ { "filePattern": "/var/log/messages*", "kinesisStream": "ClusterDBStream", "dataProcessingOptions": [{ "optionName": "LOGTOJSON", "logFormat": "SYSLOG" }] } ] } "/var/log/messages" is a SYSLOG file and so a "dataProcessingOptions" field is included in the configuration to automatically convert each log into a JSON document before writing it to the Kinesis Stream. The agent will not run as root and so the permissions for "/var/log/messages" need to be made more permissive: sudo chmod og+r /var/log/messages The agent can now be started: sudo service aws-kinesis-agent start Monitor the agent's log file to see what it's doing: sudo tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log If there aren't enough logs being generated on the machine then extra ones can be injected manually for testing: logger -i This is a test log This will create a log with the "program" field set to your username (in this case, "ec2-user"). Check that the logs get added to MongoDB Atlas: mongo "mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0" --ssl --username kinesis-user --password ?????? use clusterdb db.logdata.findOne({program: "ec2-user"}) { "_id" : ObjectId("5804c9ed5f0fbb074446ad5f"), "timestamp" : "Oct 17 12:53:48", "hostname" : "ip-172-31-40-154", "program" : "ec2-user", "processid" : "6377", "message" : "This is a test log", "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T12:54:05.456Z") } } Checking the Data with MongoDB Compass To visually navigate through the MongoDB schema and data, download and install MongoDB Compass . Use your MongoDB Atlas credentials to connect Compass to your MongoDB database (the hostname should refer to the primary node in your replica set or a "mongos" process if your MongoDB cluster is sharded). Navigate through the structure of the data in the "clusterdb" database (Figure 5) and view the JSON documents. Figure 5: Explore Schema Using MongoDB Compass Clicking on a value builds a query and then clicking "Apply" filters the results (Figure 6). Figure 6: View Filtered Documents in MongoDB Compass Add Document Validation Rules One of MongoDB’s primary attractions for developers is that it gives them the ability to start application development without first needing to define a formal schema. Operations teams appreciate the fact that they don’t need to perform a time-consuming schema upgrade operation every time the developers need to store a different attribute. This is well suited to the application built in this post as logs from different sources are likely to include different attributes. There are however some attributes that we always expect to be there (e.g., the metadata that the application is adding). For applications reading the documents from this collection to be able to rely on those fields being present, the documents should be validated before they are written to the database. Prior to MongoDB 3.2, those checks had to be implemented in the application but they can now be performed by the database itself. Executing a single command from the "mongo" shell adds the document checks: db.runCommand({ collMod: "logdata", validator: { $and: [ {program: {$type: "string"}}, {"metaData.mongoLabel": {$type: "string"}}, {"metaData.timeAdded": {$type: "date"}} ]}}) The above command adds multiple checks: The "program" field exists and contains a string There's a sub-document called "metadata" containing at least 2 fields: "mongoLabel" which must be a string "timeAdded" which must be a date Test that the rules are correctly applied when attempting to write to the database: db.logdata.insert( { "program" : "dummy_entry", "time" : 666, "sensor" : "sensor-6666", "reading" : 66666, "metaData" : { "mongoLabel" : "Test Data", "timeAdded" : ISODate("2016-10-17T13:27:44.142Z") } }) WriteResult({ "nInserted" : 1 }) db.logdata.insert( { "program" : "dummy_entry", "time" : 666, "sensor" : "sensor-6666", "reading" : 66666, "metaData" : { "mongoLabel" : "Test Data", "timeAdded" : "Just now" } }) WriteResult({ "nInserted" : 0, "writeError" : { "code" : 121, "errmsg" : "Document failed validation" } }) Cleaning Up (IMPORTANT!) Remember that you will continue to be charged for the services even when you're no longer actively using them. If you no longer need to use the services then clean up: From the MongoDB Atlas GUI, select your Cluster, click on the ellipses and select "Terminate". From the AWS management console select the Kinesis service, then Kinesis Streams, and then delete your stream. From the AWS management console select the DynamoDB service, then tables, and then delete your table. Using MongoDB Atlas with Other Frameworks and Services We have detailed walkthroughs for using MongoDB Atlas with several programming languages and frameworks, as well as generic instructions that can be used with others. They can all be found in Using MongoDB Atlas From Your Favorite Language or Framework . MongoDB Atlas, the cloud database service for MongoDB, is the easiest way to deploy and run MongoDB, allowing you to get started in minutes. Click here to learn more. The MongoDB team will be at AWS re:Invent this November in Las Vegas and our CTO Eliot Horowitz will be speaking Thursday (12/1) at 11 am PST. If you’re attending re:Invent, be sure to attend the session & visit us at booth #1344! Learn more about AWS re:Invent

August 8, 2016

Next →

Building Gen AI with MongoDB & AI Partners | February 2025

February was big for MongoDB—and, more importantly, for anyone looking to build AI applications that deliver highly accurate, relevant information (in other words, for everyone building AI apps). MongoDB announced the acquisition of Voyage AI , a pioneer in state-of-the-art embedding and reranking models that power next-generation AI applications. Because generative AI is by nature probabilistic, models can “hallucinate”, and generate false or misleading information. This can lead to serious risks, especially in cases or industries (e.g., financial services) where accurate information is paramount. To address this, organizations building AI apps need high-quality retrieval; they need to trust that the most relevant information is extracted from their data with precision. Voyage AI’s advanced embedding and reranking models enable applications to extract meaning from highly specialized and domain-specific text and unstructured data. With roots at Stanford and MIT, Voyage AI’s world-class team is trusted by AI innovators like Anthropic, LangChain, Harvey, and Replit. Integrating Voyage AI’s technology with MongoDB will enable organizations to easily build trustworthy, AI-powered applications by offering highly accurate and relevant information retrieval deeply integrated with operational data. For more, check out MongoDB CEO Dev Ittycheria’s blog post about Voyage AI , and what this means for developers and businesses (in short, delivering high-quality results at scale). Onward! P.S. If you’re in Vegas for HumanX this week, stop by booth 412 to say hi to MongoDB! Welcoming new AI and tech partners The Voyage AI news was hardly the only exciting development last month. In February 2025, MongoDB welcomed three new AI and tech partners that offer product integrations with MongoDB. Read on to learn more about each great new partner! CopilotKit Seattle-based CopilotKit provides open source infrastructure for in-app AI copilots. CopilotKit helps organizations build production-ready copilots and agents effortlessly. “We’re excited to be partnering with MongoDB to help companies build best-in-class copilots that leverage RAG & take action based on internal data,” said Uli Barkai, Co-Founder and Chief Marketing Officer at CopilotKit. “MongoDB made it dead simple to build a scalable vector database with operational data. This collaboration enables developers to easily ship production-grade RAG applications.” Varonis Varonis is the leader in data security, protecting data wherever it lives—across SaaS, IaaS, and hybrid cloud environments. Varonis’ cloud-native Data Security Platform continuously discovers and classifies critical data, removes exposures, and detects advanced threats with AI-powered automation. “Varonis’s mission is to protect data wherever it lives,” said David Bass, Executive Vice President of Engineering and Chief Technology Officer at Varonis. “We are thrilled to further advance our mission by offering AI-powered data security and compliance for MongoDB, the database of choice for high-performance application and AI development. With this integration, joint customers can automatically discover and classify sensitive data, detect abnormal activities, secure AI data pipelines, and prevent data leaks.” Xlrt Xlrt is an automated insight-generation platform that enables financial institutions to create innovative financial credit products at scale by simplifying the financial spreading process. “We are excited to partner with MongoDB Atlas to transform AI-driven financial workflows,” said Rupesh Chaudhuri, Chief Operating Officer and Co-Founder of Xlrt. “XLRT.ai leverages agentic AI, combining graph-based contextualization, vector search, and LLMs to redefine data-driven decision-making. With MongoDB's robust NoSQL and vector search capabilities, we’re delivering unparalleled efficiency, accuracy, and scalability in automating financial processes.” To learn more about building AI-powered apps with MongoDB, check out our AI Learning Hub and stop by our Partner Ecosystem Catalog to read about our integrations with MongoDB’s ever-evolving AI partner ecosystem. And visit the MongoDB AI Applications Program (MAAP) page to learn how MongoDB and the MAAP ecosystem helps organizations build applications with advanced AI capabilities.

March 12, 2025