O custo de não conhecer o MongoDB

Artur Costa23 min read • Published Nov 11, 2024 • Updated Nov 11, 2024

Node.js Esquema TypeScript MongoDB

Classifique este artigo

The primary focus of this series is to show how much performance you can gain, and as a consequence, the cost you can save when using MongoDB properly, following the best practices, studying your application needs, and using it to model your data.

To show these possible gains, a dummy application will be presented, and many possible implementations of it using MongoDB will be developed and load-tested. There will be implementations for all levels of MongoDB knowledge: beginner, intermediate, senior, and mind-blowing (🤯) .

All the code and some extra information used through this article can be found in the Github repository.

The application: finding fraudulent behavior in transactions

The application goal is to identify fraudulent behavior in a financial transaction system by analyzing the transactions' statuses over a time period for a determined user. The possible transaction statuses are approved, noFunds, pending, e rejected. Each user is uniquely identifiable by a 64-character hexadecimal key value.

The application will receive the transaction status through an event document. An event will always provide information for one transaction for one user on a specific day, and because of that, it will always have only one of the possible status fields, and this status field will have the numeric value 1. As an example, the following event document represents a transaction with the status of pending for the user with identification key of ...0001 that happened on the date/day 2022-02-01:

1 const event = {  
2 	key: '0000000000000000000000000000000000000000000000000000000000000001',
3 	date: new Date('2022-02-01'),
4 	pending: 1,
5 };

The statuses of the transactions will be analyzed by comparing the totals of statuses in the last oneYear, threeYears,fiveYears, sevenYears, e tenYears for any user. These totals will be provided in a reports document, which can be requested by providing the user key and the end date of the report.

The following document is an example of a reports document for the user of key ...0001 and an end date of 2022-06-15:

1 export const reports = [  
2 	{    
3 		id: 'oneYear',    
4 		end: new Date('2022-06-15T00:00:00.000Z'),    
5 		start: new Date('2021-06-15T00:00:00.000Z'),    
6 		totals: { approved: 4, noFunds: 1, pending: 1, rejected: 1 },
7 	},  
8 	{    
9 		id: 'threeYears',    
10 		end: new Date('2022-06-15T00:00:00.000Z'),    
11 		start: new Date('2019-06-15T00:00:00.000Z'),    
12 		totals: { approved: 8, noFunds: 2, pending: 2, rejected: 2 },
13 	},  
14 	{    
15 		id: 'fiveYears',    
16 		end: new Date('2022-06-15T00:00:00.000Z'),    
17 		start: new Date('2017-06-15T00:00:00.000Z'),    
18 		totals: { approved: 12, noFunds: 3, pending: 3, rejected: 3 },
19 	},  
20 	{    
21 		id: 'sevenYears',    
22 		end: new Date('2022-06-15T00:00:00.000Z'),    
23 		start: new Date('2015-06-15T00:00:00.000Z'),    
24 		totals: { approved: 16, noFunds: 4, pending: 4, rejected: 4 },
25 	},  
26 	{    
27 		id: 'tenYears',    
28 		end: new Date('2022-06-15T00:00:00.000Z'),    
29 		start: new Date('2012-06-15T00:00:00.000Z'),    
30 		totals: { approved: 20, noFunds: 5, pending: 5, rejected: 5 },
31 	},
32 ];

The load test

Two functions for each application version were created to be executed simultaneously and load-test each application version's performance. One function is called Bulk Upsert, which inserts the event documents. The other is called Get Reports, which generates the reports for a specific user key and date. The parallelization of the execution of each function was made using worker threads, with 20 workers allocated to each function. The test's duration for each application version is 200 minutes, with different execution parameters being used through the load test.

The Bulk Upsert function will receive batches of 250 event documents to be registered. As the name implies, these registrations will be made using MongoDB's upsertfunctionality. It will try to update a document and, if it doesn't exist, create a new one with the data available in the update operation. Each Bulk Upsert iteration will be timed and registered in a secondary database. The rate of batch processing will be divided into four phases, each with 50 minutes, totaling 200 minutes. The rate will start with one batch insert per second and will be increased by one every 50 minutes, ending with four batch inserts per second, or 1000 event documents per second.

The Get Reports function will generate one reports document per execution. Each reports execution will be timed and registered in the secondary database. The rate of generating reports will be divided into 40 phases, 10 phases for eachBulk Upsert phase. In each phase of Bulk Upsert, the rate will start with 25 report requests per second and increase by 25 report requests per second every five minutes, ending with 250 complete reports per second in the same Bulk Upsert phase.

The following graph depicts the rates of Bulk Upsert and Get Reports for the test scenario presented above:

Initial scenario and data generator

To make a fair comparison between the application versions, the initial scenario/working set used in the tests had to be greater than the memory of the machine running the MongoDB server, forcing cache activity and avoiding the situation where all the working set would fit in the cache. To accomplish that, the following parameters were chosen:

10 years of data, from 2010-01-01 to 2020-01-01
50 million events per year, totaling 500 million for the working set
60 events per user/key per year

Considering the number of events per year and the number of events per user per year, the total number of users is 50.000.000/60=833.333. The user key generator was tuned to produce keys approaching a normal/gaussian distribution to simulate a real-world scenario where some users will have more events than others. The following graph shows the distribution of 50 million keys generated by the key generator.

To also approach a real-world scenario, the distribution of the event statuses is:

80% approved.
10% noFunds.
7.5% pending.
2.5% rejected.

The instances configuration

The EC2 instance running the MongoDB server is a c7a.large on the AWS cloud. It has 2vCPU and 4GB of memory. Two disks were attached to it: one for the operating system with 15GB of size and GP3 type, and the other for the MongoDB server, which stores its data with 300GB of size, IO2 type, and 10.000IOPS. The operating system installed on the instance is Ubuntu 22.04, with all the updates and upgrades available at the time. All the recommended production notes were applied to the machine to allow MongoDB to extract the maximum performance of the available hardware.

The EC2 instance running the application server is a c6a.xlarge on the AWS cloud. It has 4vCPU and 8GB of memory. Two disks were attached to it: one for the operating system with 10GB of size and GP3 type, and the other for the secondary MongoDB server, which stores its data with 10GB of size and GP3 type. The operating system installed on the instance is Ubuntu 22.04, with all the updates and upgrades available at the time. All the recommended production notes were applied to the machine to allow MongoDB to extract the maximum performance of the available hardware.

Application Version 1 (appV1)

The first application version and the base case for our comparison would have been developed by someone with a junior knowledge level of MongoDB who just took a quick look at the documentation and learned that every document in a collection must have an _id field and this field is always unique indexed.

To take advantage of the _id obligatory field and index, the developer decides to store the values of key and date in an embedded document in the _id field. With that, each document will register the status totals for one user, specified by the field _id.key, in one day, specified by the field _id.date.

Esquema

The application implementation presented above would have the following TypeScript document schema denominated ScemaV1:

1 type SchemaV1 = {  
2 	_id: {    
3 		key: string;    
4 		date: Date;  
5 	};  
6 	approved?: number;  
7 	noFunds?: number;  
8 	pending?: number;  
9 	rejected?: number;
10 };

Bulk upsert

Based on the specification presented, we have the following bulk updateOne operation for each event generated by the application:

1 const operation = {  
2 	updateOne: {    
3 		filter: {      
4 			_id: { date: event.date, key: event.key },    
5 		},    
6 		update: {      
7 			$inc: {        
8 				approved: event.approved,        
9 				noFunds: event.noFunds,        
10 				pending: event.pending,        
11 				rejected: event.rejected,      
12 			},    
13 		},    
14 		upsert: true,  
15 	},
16 };

Get reports

Five aggregation pipelines, one for each date interval, will be needed to fulfill the Get Reports operation. Each date interval will have the following pipeline, with just the _id.date range in the $match filter being different:

1 const pipeline = [
2   {  
3     $match: {  
4       '_id.key': request.key,  
5       '_id.date': { $gte: Date.now() - oneYear, $lt: Date.now() },  
6     },  
7   },  
8   {  
9     $group: {  
10       _id: null,  
11       approved: { $sum: '$approved' },  
12       noFunds: { $sum: '$noFunds' },  
13       pending: { $sum: '$pending' },  
14       rejected: { $sum: '$rejected' },  
15     },  
16   },  
17 ];

Indexes

As presented in the introduction of this application implementation, the main goal of embedding the fields key and date in the _id field was to take advantage of its obligatory existence and index. But, after some preliminary testing and research, it was discovered that the index on the _id field wouldn't support the filtering/match criteria in the Get Reports function. With that, the following extra index was created:

1 const keys = { '_id.key': 1, '_id.date': 1 };  
2 const options = { unique: true };  
3   
4 db.appV1.createIndex(keys, options);

For those wondering why we need an extra index in the fields of the embedded document in the _id field, which is already indexed by default, a detailed explanation can be found in Index on embedded documents.

Initial scenario stats

Inserting the 500 million event documents for the initial scenario in the collection appV1 with the schema and Bulk Upsert function presented above, we have the following collection stats:

Collection	Documents	Data Size	Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB

Another interesting metric that we can keep an eye on through the application versions is the storage size needed, data, and index, to store one of the 500 million events—let's call it event stats. We can obtain this value by dividing the Data Size and Index Size of the initial scenario stats by the number of event documents. For the appV1, we have the following event stats:

Collection	Data Size/events	Index Size/events	Total Size/events
appV1	85B	43.1B	128.1B

Load test results

Executing the load test for appV1, we have the following results for Get Reports and Bulk Upsert:

The graphs above show that in almost no moment, the appV1 was able to reach the desired rates. The first stage of Bulk Upsert lasts for 50 minutes with a desired rate of 250 events per second. The event rate is only achieved in the first 10 minutes of the load test. The first stage of Get Reports lasts 10 minutes with a desired rate of 20 reports per second. The report rate is never achieved, with the highest value being 16.5 reports per second. As this is our first implementation and test, there is not much else to reason about.

Issues and improvements

The first issue that can be pointed out and improved in this implementation is the document schema in combination with the two indexes. Because the fields key and date are in an embedded document in the field _id, their values are indexed twice: by the default/obligatory index in the _id field and by the index we created to support the Bulk Upserts and Get Reports operations.

As the key field is a 64-character string and the date field is of type date, these two values use at least 68 bytes of storage. As we have two indexes, each document will contribute to 136 index bytes in a non-compressed scenario.

The improvement here is to extract the fields key and date from the _id field and let the _id field keep its default value of type ObjectId. The ObjectId data type takes only 12 bytes of storage.

This first implementation can be seen as a forced worst-case scenario to make the more optimized solutions look better. Unfortunately, that is not the case. It's not hard to find implementations like this on the internet and I've worked on a big project with a schema like this one, from where I got the idea for this first case.

Application Version 2 (appV2)

As discussed in the issues and improvements of appV1, embedding the fields key and date as a document in the _id field, trying to take advantage of its obligatory index, is not a good solution for our application because we would still need to create an extra index, and the index on the _id field would take more storage than needed.

To solve the issue of the index on the _id field being bigger than needed, the solution is to move the fields key and date out of the embedded document in the _id field, and let the _id field have its default value of type ObjectId. Each document would still register the status totals for one user, specified by the field key, on one day, specified by the field date, just like in appV1.

This second application version and its improvements would still have been developed by someone with a junior knowledge level of MongoDB but who has more depth in the documentation related to indexes in MongoDB, especially when indexing fields of type documents.

Esquema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV2:

1 type SchemaV2 = {  
2   _id: ObjectId;  
3   key: string;  
4   date: Date;  
5   approved?: number;  
6   noFunds?: number;  
7   pending?: number;  
8   rejected?: number;  
9 };

Bulk upsert

Based on the specification presented, we have the following bulk updateOne operation for each event generated by the application:

1 const operation = {  
2   updateOne: {  
3     filter: { key: event.key, date: event.date },  
4     update: {  
5       $inc: {  
6         approved: event.approved,  
7         noFunds: event.noFunds,  
8         pending: event.pending,  
9         rejected: event.rejected,  
10       },  
11     },  
12     upsert: true,  
13   },  
14 };

Get reports

Five aggregation pipelines, one for each date interval, will be needed to fulfill the Get Reports operation. Each date interval will have the following pipeline, with just the date range in the $match filter being different:

1 const pipeline = [
2   {  
3     $match: {  
4       key: request.key,  
5       date: { $gte: Date.now() - oneYear, $lt: Date.now() },  
6     },  
7   },  
8   {  
9     $group: {  
10       _id: null,  
11       approved: { $sum: '$approved' },  
12       noFunds: { $sum: '$noFunds' },  
13       pending: { $sum: '$pending' },  
14       rejected: { $sum: '$rejected' },  
15     },  
16   },  
17 ];

Indexes

To support the filter/match criteria of Bulk Upsert and Get Reports, the following index was created in the appV2 collection:

1 const keys = { key: 1, date: 1 };  
2 const options = { unique: true };  
3   
4 db.appV2.createIndex(keys, options);

Initial scenario stats

Inserting the 500 million event documents for the initial scenario in the collection appV2 with the schema and Bulk Upsert function presented above, and also presenting the values from the previous versions, we have the following collection stats:

Collection	Documents	Data Size	Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB
appV2	359,614,536	41.92GB	126B	10.46GB	2	16.66GB

Calculating the event stats for appV2 and also presenting the values from the previous versions, we have the following:

Collection	Data Size/events	Index Size/events	Total Size/events
appV1	85B	43.1B	128.1B
appV2	90B	35.8B	125.8B

Analyzing the tables above, we can see that from appV1 to appV2, we increased the data size by 6% and decreased the index size by 17%. We can say that our goal of making the index on the _id field smaller was accomplished.

Looking at the event stats, the total size per event value decreased only by 1.8%, from 128.1B to 125.8B. With this difference being so small, there is a good chance that we won’t see significant improvements from a performance point of view.

Load test results

Executing the load test for appV2 and plotting it alongside the results for appV1, we have the following results for Get Reports and Bulk Upsert:

The graphs above show that in almost no moment, appV2 reached the desired rates, having a result very similar to the appV1, as predicted in the Initial Scenario Stats when we got only a 1.7% improvement in the event stats. appV2 only reached the Bulk Upsert desired rate of 250 events per second in the first 10 minutes of the test and got only 17 reports per second in Get Reports, lower than the 20 reports per second desired.

Comparing the two versions, we can see that appV2 performed better than appV1 for the Bulk Upsert operations and worse for the Get Reports operations. The improvement in the Bulk Upsert operations can be attributed to the indexes being smaller and the degradation in the Get Reports can be attributed to the document being bigger.

Issues and improvements

The following document is a sample from the collection appV2:

1 const document = {  
2   _id: ObjectId('6685c0dfc2445d3c5913008f'),  
3   key: '0000000000000000000000000000000000000000000000000000000000000001',  
4   date: new Date('2022-06-25T00:00:00.000Z'),  
5   approved: 10,  
6   noFunds: 3,  
7   pending: 1,  
8   rejected: 1,  
9 };

Analyzing it with the goal of reducing its size, two points of improvement can be found. One is the field key, which is a string and will always have 64 characters of hexadecimal data, and the other is the name of the statuses fields, which combined can have up to 30 characters.

The field key, as presented in the scenario section, is composed of hexadecimal data, in which each character requires four bits to be presented. In our implementation so far, we have stored this data as strings using UTF-8 encoding, in which each character requires eight bits to be represented. So, we are using double the storage we need. One way around this issue is to store the hexadecimal data in its raw format using the binary data.

For the status field names, we can see that the names of the fields use more storage than the value itself. The field names are strings with at least seven UTF-8 characters, which takes at least seven bytes. The value of the status fields is a 32-bit integer, which takes four bytes. We can shorthand the status names by their first character, where approved becomes a, noFunds becomes n, pending becomes p, e rejected becomes r.

Application Version 3 (appV3)

As discussed in the issues and improvements of appV2, to reduce the document size, two improvements were proposed. One is to convert the data type of the field key from a string to binary, requiring four bits to represent each hexadecimal character instead of the eight bits of a UTF-8 character. The other is to shorthand the name of the status fields by their first letter, requiring one byte for each field name instead of seven bytes. Each document would still register the status totals for one user, specified by the field key, on one day, specified by the field date, just like in the previous implementations.

To convert the key value from a string to binary/buffer, the following TypeScript function was created:

1 const buildKey = (key: string): Buffer => {  
2   return Buffer.from(key, 'hex');  
3 };

The third application version has two improvements compared to the second version. The improvement of storing the field key as binary data to reduce its storage needs would have been thought of by an intermediate to senior MongoDB developer who has read the MongoDB documentation many times and worked on different projects. The improvement of shorthanding the name of the status fields would have been thought of by an intermediate MongoDB developer who has gone through some of the MongoDB documentation.

Esquema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV3:

1 type SchemaV3 = {  
2   _id: ObjectId;  
3   key: Buffer;  
4   date: Date;  
5   a?: number;  
6   n?: number;  
7   p?: number;  
8   r?: number;  
9 };

Bulk upsert

Based on the specification presented, we have the following bulk updateOne operation for each event generated by the application:

1 const operation = {  
2   updateOne: {  
3     filter: { key: buildKey(event.key), date: event.date },  
4     update: {  
5       $inc: {  
6         a: event.approved,  
7         n: event.noFunds,  
8         p: event.pending,  
9         r: event.rejected,  
10       },  
11     },  
12     upsert: true,  
13   },  
14 };

Get reports

Five aggregation pipelines, one for each date interval, will be needed to fulfill the Get Reports operation. Each date interval will have the following pipeline, with just the date range in the $match filter being different:

1 const pipeline = [  
2   {  
3     $match: {  
4       key: buildKey(request.key),  
5       date: { $gte: Date.now() - oneYear, $lt: Date.now() },  
6     },  
7   },  
8   {  
9     $group: {  
10       _id: null,  
11       approved: { $sum: '$a' },  
12       noFunds: { $sum: '$n' },  
13       pending: { $sum: '$p' },  
14       rejected: { $sum: '$r' },  
15     },  
16   },  
17 ];

Indexes

To support the filter/match criteria of Bulk Upsert and Get Reports, the following index was created in the appV3 collection:

1 const keys = { key: 1, date: 1 };  
2 const options = { unique: true };  
3   
4 db.appV3.createIndex(keys, options);

Initial scenario stats

Inserting the 500 million event documents for the initial scenario in the collection appV3 with the schema and Bulk Upsert function presented above, and also presenting the values from the previous versions, we have the following collection stats:

Collection	Documents	Data Size	Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB
appV2	359,614,536	41.92GB	126B	10.46GB	2	16.66GB
appV3	359,633,376	28.7GB	86B	8.96GB	2	16.37GB

Calculating the event stats for appV3 and also presenting the values from the previous versions, we have the following:

Collection	Data Size/events	Index Size/events	Total Size/events
appV1	85B	43.1B	128.1B
appV2	90B	35.8B	125.8B
appV3	61.6B	35.2B	96.8B

Analyzing the tables above, we can see that from appV2 to appV3, there was practically no change in the index size and a decrease of 32% in the data size. Our goal of reducing the document size was accomplished.

Looking at the event stats, the total size per event value decreased by 23%, from 125.8B to 96.8B. With this reduction, we’ll probably see considerable improvements.

Load test results

Executing the load test for appV3 and plotting it alongside the results for appV2, we have the following results for Get Reports and Bulk Upsert:

The graphs above clearly show that appV3 is more performant than appV2, and we are starting to get closer to some desired rates. appV3 was able to provide the desired rates for the first 100 minutes of Bulk Upsert operations: 250 events per second from 0 to 50 minutes, and 500 events per second from 50 minutes to 100 minutes. On the other hand, Get Report operations are still not able to reach the lower desired rate of 20 reports per second, but clearly had better performance than appV2, being able to keep the rate of 16 reports per second for half the test.

The entire performance improvement can be attributed to the reduction of the document size, as it was the only change between appV2 and appV3.

Issues and improvements

Looking at the collection stats of appV3 and thinking about how MongoDB is executing our queries and what indexes are being used, we can see that the _id field and its index aren’t being used in our application. The field itself is not a big deal from a performance standpoint, but its obligatory unique index is. Every time a new document is inserted in the collection, the index structure on the _id field has to be updated.

Going back to the idea from appV1 of trying to take advantage of the obligatory _id field and its index, is there a way that we can use it in our application?

Let’s take a look at our filtering criteria in the Get Report and Bulk Upsert functions:

1 const bulkUpsertFilter = {  
2   key: event.key,  
3   date: event.date,  
4 };
5 
6 const getReportsFilter = {  
7   key: request.key,  
8   date: {  
9     $gte: new Date('2021-06-15'),  
10     $lt: new Date('2022-06-15'),  
11   },  
12 };

In both filtering criteria, the key field is compared using equality. The date field is compared using equality in the Bulk Upsert and range in the Get Reports. What if we combine these two field values into just one, concatenating them, and store it in _id?

To guide us on how we should order the fields in the resulting concatenated value and get the best performance of the index on it, let’s follow the Equality, Sort, and Range rule (ESR).

As seen above, the key field is compared by equality in both cases, and the date field is compared by equality just in one case, so, let’s choose the key field for the first part of our concatenated value and the date field for the second part. As we don’t have a Sort operation in our queries, we can skip it. Next, we have Range comparison, which is used in the date field, so now it makes sense to keep it as the second part of our concatenated value. With that, the most optimal way of concatenating the two values and getting the best performance of its index is key + date.

One point of attention is how we are going to format the date field in this concatenation in a way that the range filter works, and we don’t store more data than we need. One possible implementation will be presented and tested in the next application version, appV4.

Application Version 4 (appV4)

As presented in the issues and improvements of appV3, one way to take advantage of the obligatory _id field and index is by storing in it the concatenated value of key + date. One thing that we need to cover now is what data type the _id field will have and how we are going to format the date field.

As seen in previous implementations, storing the key field as binary/hexadecimal data improved the performance. So, let’s see if we can also store the resulting concatenated field, key + date, as binary/hexadecimal.

To store the date field in a binary/hexadecimal type, we have some options. One could be converting it to a 4-byte timestamp that measures the seconds since the Unix epoch, and the other could be converting it to the format YYYYMMDD which stores year, month, and day. Both cases would require the same 32 bits/8 hexadecimal characters.

For our case, let’s use the second option and store the date value as YYYYMMDD because it will help in future implementations and improvements. Considering a key field with the value 0001 and a date field with the value 2022-01-01, we would have the following _id field:

1 const _id = Buffer.from('000120220101', 'hex');

To concatenate and convert the key and date fields to their desired format and type, the following TypeScript function was created:

1 const buildId = (key: string, date: Date): Buffer => {  
2   const day = date.toISOString().split('T')[0].replace(/-/g, ''); // YYYYMMDD  
3   return Buffer.from(`${key}${day}`, 'hex');  
4 };

Each document would still register the status totals for one user in one day, specified by the _id field, the same way it was done in the previous implementations.

Esquema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV4:

1 type SchemaV4 = {  
2   _id: Buffer;  
3   a?: number;  
4   n?: number;  
5   p?: number;  
6   r?: number;  
7 };

Bulk upsert

Based on the specification presented, we have the following bulk updateOne operation for each event generated by the application:

1 const operation = {  
2   updateOne: {  
3     filter: { _id: buildId(event.key, event.date) },  
4     update: {  
5       $inc: {  
6         a: event.approved,  
7         n: event.noFunds,  
8         p: event.pending,  
9         r: event.rejected,  
10       },  
11     },  
12     upsert: true,  
13   },  
14 };

Get reports

Five aggregation pipelines, one for each date interval, will be needed to fulfill the Get Reports operation. Each date interval will have the following pipeline, with just the date used in the function buildId being different:

1 const pipeline = [  
2   {  
3     $match: {  
4       _id: {  
5         $gte: buildId(request.key, Date.now() - oneYear),  
6         $lt: buildId(request.key, Date.now()),  
7       },  
8     },  
9   },  
10   {  
11     $group: {  
12       _id: null,  
13       approved: { $sum: '$a' },  
14       noFunds: { $sum: '$n' },  
15       pending: { $sum: '$p' },  
16       rejected: { $sum: '$r' },  
17     },  
18   },  
19 ];

Indexes

As this implementation will use the _id field for its operations, it won’t need an extra index to support the Bulk Upsert and Get Reports operations.

Initial scenario stats

Inserting the 500 million event documents for the initial scenario in the collection appV4 with the schema and Bulk Upsert function presented above, and also presenting the values from the previous versions, we have the following collection stats:

Collection	Documents	Data Size	Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB
appV2	359,614,536	41.92GB	126B	10.46GB	2	16.66GB
appV3	359,633,376	28.7GB	86B	8.96GB	2	16.37GB
appV4	359,615,279	19.66GB	59B	6.69GB	1	9.5GB

Calculating the event stats for appV4 and also presenting the values from the previous versions, we have the following:

Collection	Data Size/events	Index Size/events	Total Size/events
appV1	85B	43.1B	128.1B
appV2	90B	35.8B	125.8B
appV3	61.6B	35.2B	96.8B
appV4	42.2B	20.4B	62.6B

Analyzing the tables above, we can see that from appV3 to appV4, we reduced the data size by 32% and the index size by 42%—big improvements. We also have one less index to maintain now.

Looking at the event stats, the total size per event value decreased by 35%, from 96.8B to 62.6B. With this reduction, we’ll probably see some significant improvements in performance.

Load test results

Executing the load test for appV4 and plotting it alongside the results for appV3, we have the following results for Get Reports and Bulk Upsert:

The graphs above show that appV4 is only slightly better than appV3. For the Bulk Upsert operations, both can provide the desired rates in the first 100 minutes, but neither can provide the desired rates in the last 100 minutes. However, appV4 has better rates than appV3. For the Get Reports operations, we’re still not achieving the lowest desired rate, but appV4 has better rates on average than appV3.

Issues and improvements

Enough of focusing on reducing document sizes for better performance—let’s look at the application behavior instead.

When generating the oneYear totals, the Get Reports function will need to retrieve something close to 60 documents on average, and in the worst-case scenario, 365 documents. To access each one of these documents, one index entry will have to be visited, and one disk read operation will have to be performed. How can we increase the data density of the documents in our application and reduce the index entries and read operations needed to perform the desired operation?

One way of doing that is by using the Bucket Pattern. According to the MongoDB documentation, "The bucket pattern separates long series of data into distinct objects. Separating large data series into smaller groups can improve query access patterns and simplify application logic."

Looking at our application from the perspective of the bucket pattern, so far, we have bucketed our data by daily user, each document containing the status totals for one user in one day. We can increase the bucketing range or our schema and store events or status totals from a week, month, or even quarter in one document.

Conclusão

That is the end of the first part of the series. We covered how indexes work on fields of type documents and saw some small changes that we can make to our application to reduce its storage and index needs, and as a consequence, improve its performance.

Here’s a quick review of the improvements made between the application versions:

appV1 to appV2: Moved the fields key and date out from an embedded document in the _id field and let it have its default value of ObjectId.
appV2 to appV3: Reduced the document size by short-handing the names of the status fields and changed the data type of the key field from string to binary/hexadecimal.
appV3 to appV4: Removed the need for an extra index by concatenating the values of key and date and storing them in the _id field.

So far, none of our applications have gotten close to the desired rates, but let’s not give up. As presented in the issues and improvements of appV4, we can still improve our application by using the Bucket Pattern. The Bucket Pattern with the Computed Pattern will be the main points of improvement for the next application version, appV5, and its revisions.

For any further questions, you can go to the MongoDB Community Forum, or if you want to build your application using MongoDB, the Centro de desenvolvedores do MongoDB has lots of examples in many different programming languages.

Appendices

Index on Embedded Documents {#index-on-embedded-documents}

Let’s take a look at how MongoDB indexes a field with a value of type document and see why we need an extra index for the appV1 implementation.

First, let’s check if the index on the _id field won’t be used for our queries by executing a find operation with the same filtering criteria used in the Bulk Upsert and Get Reports functions and applying the explain functionality to it.

1 // A sample document
2 const doc = {  
3   _id: { key: '0001', date: new Date('2020-01-01') },  
4   approved: 2,  
5   rejected: 1,  
6 };
7 
8 // Making sure we have an empty collection
9 db.appV1.drop();
10 
11 // Inserting the document in the `appV1` collection
12 db.appV1.insertOne(doc);
13 
14 // Finding a document using `Bulk Upsert` filtering criteria
15 const bulkUpsertFilter = {  
16   _id: { key: '0001', date: new Date('2020-01-01') },  
17 };
18 db.appV1.find(bulkUpsertFilter).explain('executionStats');
19 /*{
20 ...
21   executionStats: {
22     nReturned: 1,
23     totalKeysExamined: 1,
24     totalDocsExamined: 1,
25     ...
26     executionStages: {
27       stage: 'EXPRESS_IXSCAN',
28       ...
29     }
30     ...
31   },
32   ...
33 }*/
34 
35 // Finding a document using `Get Reports` filtering criteria
36 const getReportsFilter = {  
37   '_id.key': '0001',  
38   '_id.date': { $gte: new Date('2019-01-01'), $lte: new Date('2021-01-01') },  
39 };
40 db.appV1.find(getReportsFilter).explain('executionStats');
41 /*{
42 ...
43   executionStats: {
44     nReturned: 1,
45     totalKeysExamined: 0,
46     totalDocsExamined: 1,
47     ...
48     executionStages: {
49       stage: 'COLLSCAN',
50       ...
51     }
52     ...
53   },
54   ...
55 }*/

As shown by the output of the explainable queries, we have a collection scan (COLLSCAN) for the Get Reports filtering criteria, which indicates that an index wasn’t used to execute the query.

Most data types supported in MongoDB will be directly indexed without any special treatment or conversion. The special cases are fields of type array or documents. The array case is not our current focus, but it can be seen in Create an Index on an Array Field. The document or embedded document case can be seen in Create an Index on an Embedded Document. Using the knowledge of the document case in our implementation, we could say that the value of the field _id in the index structure would be a stringified version of the embedded document.

1 const documentValue = { key: '0001', date: '2010-01-01T00:00:00.000Z' };  
2 const indexValue = "{key:0001,date:2010-01-01T00:00:00.000Z}";

With the index value being a blob of data, MongoDB is not capable of accessing inner/embedded values, because they don’t exist in this representation, and as a consequence, MongoDB cannot use the index value to filter by _id.key or _id.date.

Principais comentários nos fóruns

Ainda não há comentários sobre este artigo.

Iniciar a conversa

Classifique este artigo

Relacionado

exemplo de código

Gestão de revistas

Sep 11, 2024 | 0 min read

Tutorial

RAG priorizado localmente e aprimorado por meio do Rastreamento comDeepSeek e MongoDB

Feb 06, 2025 | 8 min read

Tutorial

Introdução ao MongoDB e ao AWS Codewhisperer

Sep 26, 2024 | 3 min read

Tutorial

Introdução ao MongoDB e ao Amazon Q Assistente de codificação IA generativa

Sep 25, 2024 | 3 min read

Sumário

The application: finding fraudulent behavior in transactions
The load test
Initial scenario and data generator
The instances configuration
Application Version 1 (appV1)
Application Version 2 (appV2)
Application Version 3 (appV3)
Application Version 4 (appV4)
Conclusão
Appendices

1	const event = {
2	key: '0000000000000000000000000000000000000000000000000000000000000001',
3	date: new Date('2022-02-01'),
4	pending: 1,
5	};

1	export const reports = [
2	{
3	id: 'oneYear',
4	end: new Date('2022-06-15T00:00:00.000Z'),
5	start: new Date('2021-06-15T00:00:00.000Z'),
6	totals: { approved: 4, noFunds: 1, pending: 1, rejected: 1 },
7	},
8	{
9	id: 'threeYears',
10	end: new Date('2022-06-15T00:00:00.000Z'),
11	start: new Date('2019-06-15T00:00:00.000Z'),
12	totals: { approved: 8, noFunds: 2, pending: 2, rejected: 2 },
13	},
14	{
15	id: 'fiveYears',
16	end: new Date('2022-06-15T00:00:00.000Z'),
17	start: new Date('2017-06-15T00:00:00.000Z'),
18	totals: { approved: 12, noFunds: 3, pending: 3, rejected: 3 },
19	},
20	{
21	id: 'sevenYears',
22	end: new Date('2022-06-15T00:00:00.000Z'),
23	start: new Date('2015-06-15T00:00:00.000Z'),
24	totals: { approved: 16, noFunds: 4, pending: 4, rejected: 4 },
25	},
26	{
27	id: 'tenYears',
28	end: new Date('2022-06-15T00:00:00.000Z'),
29	start: new Date('2012-06-15T00:00:00.000Z'),
30	totals: { approved: 20, noFunds: 5, pending: 5, rejected: 5 },
31	},
32	];

1	type SchemaV1 = {
2	_id: {
3	key: string;
4	date: Date;
5	};
6	approved?: number;
7	noFunds?: number;
8	pending?: number;
9	rejected?: number;
10	};

1	const operation = {
2	updateOne: {
3	filter: {
4	_id: { date: event.date, key: event.key },
5	},
6	update: {
7	$inc: {
8	approved: event.approved,
9	noFunds: event.noFunds,
10	pending: event.pending,
11	rejected: event.rejected,
12	},
13	},
14	upsert: true,
15	},
16	};

1	const pipeline = [
2	{
3	$match: {
4	'_id.key': request.key,
5	'_id.date': { $gte: Date.now() - oneYear, $lt: Date.now() },
6	},
7	},
8	{
9	$group: {
10	_id: null,
11	approved: { $sum: '$approved' },
12	noFunds: { $sum: '$noFunds' },
13	pending: { $sum: '$pending' },
14	rejected: { $sum: '$rejected' },
15	},
16	},
17	];

1	const keys = { '_id.key': 1, '_id.date': 1 };
2	const options = { unique: true };
3
4	db.appV1.createIndex(keys, options);

1	type SchemaV2 = {
2	_id: ObjectId;
3	key: string;
4	date: Date;
5	approved?: number;
6	noFunds?: number;
7	pending?: number;
8	rejected?: number;
9	};

1	const operation = {
2	updateOne: {
3	filter: { key: event.key, date: event.date },
4	update: {
5	$inc: {
6	approved: event.approved,
7	noFunds: event.noFunds,
8	pending: event.pending,
9	rejected: event.rejected,
10	},
11	},
12	upsert: true,
13	},
14	};

1	const pipeline = [
2	{
3	$match: {
4	key: request.key,
5	date: { $gte: Date.now() - oneYear, $lt: Date.now() },
6	},
7	},
8	{
9	$group: {
10	_id: null,
11	approved: { $sum: '$approved' },
12	noFunds: { $sum: '$noFunds' },
13	pending: { $sum: '$pending' },
14	rejected: { $sum: '$rejected' },
15	},
16	},
17	];

1	const keys = { key: 1, date: 1 };
2	const options = { unique: true };
3
4	db.appV2.createIndex(keys, options);

1	const document = {
2	_id: ObjectId('6685c0dfc2445d3c5913008f'),
3	key: '0000000000000000000000000000000000000000000000000000000000000001',
4	date: new Date('2022-06-25T00:00:00.000Z'),
5	approved: 10,
6	noFunds: 3,
7	pending: 1,
8	rejected: 1,
9	};

1	const buildKey = (key: string): Buffer => {
2	return Buffer.from(key, 'hex');
3	};

1	type SchemaV3 = {
2	_id: ObjectId;
3	key: Buffer;
4	date: Date;
5	a?: number;
6	n?: number;
7	p?: number;
8	r?: number;
9	};

1	const operation = {
2	updateOne: {
3	filter: { key: buildKey(event.key), date: event.date },
4	update: {
5	$inc: {
6	a: event.approved,
7	n: event.noFunds,
8	p: event.pending,
9	r: event.rejected,
10	},
11	},
12	upsert: true,
13	},
14	};

1	const pipeline = [
2	{
3	$match: {
4	key: buildKey(request.key),
5	date: { $gte: Date.now() - oneYear, $lt: Date.now() },
6	},
7	},
8	{
9	$group: {
10	_id: null,
11	approved: { $sum: '$a' },
12	noFunds: { $sum: '$n' },
13	pending: { $sum: '$p' },
14	rejected: { $sum: '$r' },
15	},
16	},
17	];

1	const keys = { key: 1, date: 1 };
2	const options = { unique: true };
3
4	db.appV3.createIndex(keys, options);

1	const bulkUpsertFilter = {
2	key: event.key,
3	date: event.date,
4	};
5
6	const getReportsFilter = {
7	key: request.key,
8	date: {
9	$gte: new Date('2021-06-15'),
10	$lt: new Date('2022-06-15'),
11	},
12	};

1	const buildId = (key: string, date: Date): Buffer => {
2	const day = date.toISOString().split('T')[0].replace(/-/g, ''); // YYYYMMDD
3	return Buffer.from(`${key}${day}`, 'hex');
4	};

1	type SchemaV4 = {
2	_id: Buffer;
3	a?: number;
4	n?: number;
5	p?: number;
6	r?: number;
7	};

1	const operation = {
2	updateOne: {
3	filter: { _id: buildId(event.key, event.date) },
4	update: {
5	$inc: {
6	a: event.approved,
7	n: event.noFunds,
8	p: event.pending,
9	r: event.rejected,
10	},
11	},
12	upsert: true,
13	},
14	};

1	// A sample document
2	const doc = {
3	_id: { key: '0001', date: new Date('2020-01-01') },
4	approved: 2,
5	rejected: 1,
6	};
7
8	// Making sure we have an empty collection
9	db.appV1.drop();
10
11	// Inserting the document in the `appV1` collection
12	db.appV1.insertOne(doc);
13
14	// Finding a document using `Bulk Upsert` filtering criteria
15	const bulkUpsertFilter = {
16	_id: { key: '0001', date: new Date('2020-01-01') },
17	};
18	db.appV1.find(bulkUpsertFilter).explain('executionStats');
19	/*{
20	...
21	executionStats: {
22	nReturned: 1,
23	totalKeysExamined: 1,
24	totalDocsExamined: 1,
25	...
26	executionStages: {
27	stage: 'EXPRESS_IXSCAN',
28	...
29	}
30	...
31	},
32	...
33	}*/
34
35	// Finding a document using `Get Reports` filtering criteria
36	const getReportsFilter = {
37	'_id.key': '0001',
38	'_id.date': { $gte: new Date('2019-01-01'), $lte: new Date('2021-01-01') },
39	};
40	db.appV1.find(getReportsFilter).explain('executionStats');
41	/*{
42	...
43	executionStats: {
44	nReturned: 1,
45	totalKeysExamined: 0,
46	totalDocsExamined: 1,
47	...
48	executionStages: {
49	stage: 'COLLSCAN',
50	...
51	}
52	...
53	},
54	...
55	}*/

1	const documentValue = { key: '0001', date: '2010-01-01T00:00:00.000Z' };
2	const indexValue = "{key:0001,date:2010-01-01T00:00:00.000Z}";