Entangled: A Story of Data Re-modeling and 10x Storage Reduction
Nenad Milosavljevic5 min read • Published Dec 14, 2023 • Updated Dec 14, 2023
Rate this article
One of the most distinctive projects I've worked on is an application named Entangled. Developed in partnership with the Princeton Engineering Anomalies Research lab (PEAR), The Global Consciousness Project, and the Institute of Noetic Sciences, Entangled aims to test human consciousness.
The application utilizes a quantum random number generator to measure the influence of human consciousness. This quantum generator is essential because conventional computers, due to their deterministic nature, cannot generate truly random numbers. The quantum generator produces random sequences of 0s and 1s. In large datasets, there should be an equal number of 0s and 1s.
For the quantum random number generation, we used an in-house Quantis QRNG USB device. This device is plugged into our server, and through specialized drivers, we programmatically obtain the random sequences directly from the USB device.
Experiments were conducted to determine if a person could influence these quantum devices with their thoughts, specifically by thinking about more 0s or 1s. The results were astonishing, demonstrating the real potential of this influence.
To expand this test globally, we developed a new application. This platform allows users to sign up and track their contributions. The system generates a new random number for each user every second. Every hour, these contributions are grouped for analysis at personal, city, and global levels. We calculate the standard deviation of these contributions, and if this deviation exceeds a certain threshold, users receive notifications.
This data supports various experiments. For instance, in the "Earthquake Prediction" experiment, we use the contributions from all users in a specific area. If the standard deviation is higher than the set threshold, it may indicate that users have predicted an earthquake.
As the lead backend developer, and with MongoDB being my preferred database for all projects, it was a natural choice for Entangled.
For the backend development, I chose Node.js (Express), along with the Mongoose library for schema definition and data modeling. Mongoose, an Object Data Modeling (ODM) library for MongoDB, is widely used in the Node.js ecosystem for its ability to provide a straightforward way to model our application data.
Careful schema modeling was crucial due to the anticipated scaling of the database. Remember, we were generating one random number per second for each user.
My initial instinct was to create hourly-based schemas, aligning with our hourly analytics snapshots. The initial schema was structured as follows:
- User: a reference to the "Users" collection
- Total Sum: the sum of each user's random numbers; either 1s or 0s, so their sum was sufficient for later analysis
- Generated At: the timestamp of the snapshot
- Data File: a reference to the "Data Files" collection, which contains all random numbers generated by all users in a given hour
1 const { Schema, model } = require("mongoose"); 2 3 const hourlyMetricSchema = new Schema({ 4 user: { type: Schema.Types.ObjectId, ref: "Users" }, 5 total_sum: { type: Number }, 6 generated_at: { type: Date }, 7 data_file: { type: Schema.Types.ObjectId, ref: "DataFiles" } 8 }); 9 10 // Compound index forr "user" (ascending) and "generated_at" (descending) fields 11 hourlyMetricSchema.index({ user: 1, generated_at: -1 }); 12 13 const HourlyMetrics = model("HourlyMetrics", hourlyMetricSchema); 14 15 module.exports = HourlyMetrics;
Although intuitive, this schema faced a significant scaling challenge. We estimated over 100,000 users soon after launch. This meant about 2.4 million records daily or 72 million records monthly. Consequently, we were looking at approximately 5GB of data (including storage and indexes) each month.
This encouraged me to explore alternative approaches.
I explored whether alternative modeling approaches could further optimize storage requirements while also enhancing scalability and cost-efficiency.
A significant observation was that out of 5GB of total storage, 3.5GB was occupied by indexes, a consequence of the large volume of documents.
This led me to experiment with a schema redesign, shifting from hourly to daily metrics. The new schema was structured as follows:
1 const { Schema, model } = require("mongoose"); 2 3 const dailyMetricSchema = new Schema({ 4 user: { type: Schema.Types.ObjectId, ref: "Users" }, 5 date: { type: Date }, 6 samples: [ 7 { 8 total_sum: { type: Number }, 9 generated_at: { type: Date }, 10 data_file: { type: Schema.Types.ObjectId, ref: "DataFiles" } 11 } 12 ] 13 }); 14 15 // Compound index forr "user" (ascending) and "date" (descending) fields 16 hourlyMetricSchema.index({ user: 1, date: -1 }); 17 18 const DailyMetrics = model("DailyMetrics", dailyMetricSchema); 19 20 module.exports = DailyMetrics;
Rather than storing metrics for just one hour in each document, I now aggregated an entire day's metrics in a single document. Each document included a "samples" array with 24 entries, one for each hour of the day.
It's important to note that this method is a good solution because the array has a fixed size — a day only has 24 hours. This is very different from the anti-pattern of using big, massive arrays in MongoDB.
This minor modification had a significant impact. The storage requirement for a month's worth of data drastically dropped from 5GB to just 0.49GB. This was mainly due to the decrease in index size, from 3.5GB to 0.15GB. The number of documents required each month dropped from 72 million to 3 million.
Encouraged by these results, I didn't stop there. My next step was to consider the potential benefits of shifting to a monthly-metrics schema. Could this further optimize our storage? This was the question that drove my next phase of exploration.
The monthly-metrics schema was essentially identical to the daily-metrics schema. The key difference lay in how the data was stored in the "samples" array, which now contained approximately 720 records representing a full month's metrics.
1 const { Schema, model } = require("mongoose"); 2 3 const monthlyMetricSchema = new Schema({ 4 user: { type: Schema.Types.ObjectId, ref: "Users" }, 5 date: { type: Date }, 6 samples: [ 7 { 8 total_sum: { type: Number }, 9 generated_at: { type: Date }, 10 data_file: { type: Schema.Types.ObjectId, ref: "DataFiles" } 11 } 12 ] 13 }); 14 15 // Compound index forr "user" (ascending) and "date" (descending) fields 16 monthlyMetricSchema.index({ user: 1, date: -1 }); 17 18 const MonthlyMetrics = model("MonthlyMetrics", monthlyMetricSchema); 19 20 module.exports = MonthlyMetrics;
This adjustment was expected to further reduce the document count to around 100,000 documents for a month, leading me to anticipate even greater storage optimization. However, the actual results were surprising.
Upon storing a month's worth of data under this new schema, the storage size unexpectedly increased from 0.49GB to 0.58GB. This increase is likely due to the methods MongoDB's WiredTiger storage engine uses to compress arrays internally.
Below is a detailed summary of the different approaches and their respective results for one month’s worth of data:
Hourly Document | Daily Document | Monthly Document | |
---|---|---|---|
Document Size | 0.098 KB | 1.67 KB | 49.18 KB |
Total Documents (per month) | 72,000,000 (100,000 users * 24 hours * 30 days) | 3,000,000 (100,000 users * 30 days) | 100,000 (100,000 users) |
Storage Size | 1.45 GB | 0.34 GB | 0.58 GB |
Index Size | 3.49 GB | 0.15 GB | 0.006 GB |
Total Storage (Data + Index) | 4.94 GB | 0.49 GB | 0.58 GB |
In this exploration of schema modeling for the Entangled project, we investigated the challenges and solutions for managing large-scale data in MongoDB.
Our journey began with hourly metrics, which, while intuitive, posed significant scaling challenges due to the large volume of data and index size.
This prompted a shift to daily metrics, drastically reducing storage requirements by over 10 times, primarily due to a significant decrease in index size.
The experiment with monthly metrics offered an unexpected twist. Although it further reduced the number of documents, it increased the overall storage size, likely due to the internal compression mechanics of MongoDB's WiredTiger storage engine.
This case study highlights the critical importance of schema design in database management, especially when dealing with large volumes of data. It also emphasizes the need for continuous experimentation and optimization to balance storage efficiency, scalability, and cost.
If you want to learn more about designing efficient schemas with MongoDB, I recommend checking out the MongoDB Data Modeling Patterns series.