Improving MongoDB Queries by Simplifying Boolean Expressions
Key takeaways
Document databases encourage storing data in fewer collections with many fields, in contrast to relational databases’ normalization principle.
This approach improves efficiency but requires careful handling of complex filters.
Simplifying Boolean expressions can improve query performance by reducing computational overhead and enabling better plan generation.
The solution uses a modified Quine–McCluskey algorithm and Petrick’s method on an efficient bitset representation of Boolean expressions.
MongoDB's culture supports innovation by empowering engineers to tackle problems and solve them from beginning to end.
Introduction
Document databases
like MongoDB encourage a strategy that involves storing data in fewer collections with a large number of fields. This is in contrast to the normalization principle of relational databases, which recommends spreading data across numerous tables with a smaller number of fields. Denormalization and avoiding complex joins is a source of efficiency for document databases. However, as the filters become complex, you must take care to handle them properly. Poorly handled complex filters can negatively affect database performance, resulting in slower query responses and higher resource usage. Addressing complex filters becomes a critical task.
One optimization technique to mitigate the performance issues with complex filters is Boolean expression simplification. This involves reducing complex Boolean expressions into simpler forms, which the query engine can evaluate more efficiently. As a result, the database can execute queries faster and with less computational overhead.
To demonstrate the importance of filter simplification, consider a real MongoDB customer case. The query in the case was enormous in size and clearly machine-generated, which the optimizer couldn't handle efficiently. It was clear that simplifying the query would help the optimizer find a better plan.
Here's a smaller example inspired by that case: simplifying filters can lead to more efficient query plans.
db.collection.createIndex({b: 1})
filter = {$or: [{$and: [{a: 1}, {a: {$ne: 1}}]}, {b: 2}]}
db.collection.find(filter)
The query involves predicates on the field “a” so the optimizer cannot generate an Index Scan plan for the unoptimized version of the query and opts for a collection scan plan. However, the simplified expression:
{b: 2}
makes the IndexScan plan possible.
The difference between the plans can be drastic for large collections and selective indexes. In our benchmark, the simplifier showcased a remarkable 18,100% throughput improvement in this scenario.
The benefits of the filter simplification can come in different flavors: in one of our benchmarks testing large queries, execution time was cut in half in a collection scan plan. This improvement is due to the faster execution of the simplified queries.
Simplifying Boolean expressions: Modified Quine–McCluskey algorithm and Petrick’s method
The journey to find a good solution for Boolean simplification was surprisingly complex. The final solution can be expressed in just four steps:
Convert the simplifying expression into bitset DNF form
Apply QMC reduction operation of DNF terms: (x ∧ y) ∨ (x ∧ ¬y) = x
Apply Absorption law: x ∨ (x ∧ y) = x
Use Petrick’s method for further simplification
Yet we faced a few challenges along the way. This section will explore the challenges and how we addressed them.
Simplifying AST
The initial solution, applying Boolean laws directly to filters in the Abstract Syntax Tree (AST) format, which MongoDB's query engine uses, proved to be resource-intensive, consuming significant memory and CPU time. This outlines the first challenge in the journey: finding an efficient method to simplify Boolean expressions.
One issue with the initial solution was that transformations could be applied repeatedly, leading to the same expression appearing multiple times. To address this, we needed to store previously observed expressions in memory and check each time if an expression had already been seen.
Modified Quine–McCluskey
The Quine–McCluskey algorithm, commonly used in digital circuit design, offers a different approach with a finite number of steps. The explanation of the algorithm might seem daunting but in essence, we just apply the following reduction rule to a pair of expressions:
(x ∧ y) ∨ (x ∧ ¬y) = x
This allows the pair of expressions to be reduced into one.
A challenge with the Quine–McCluskey algorithm is that it requires a list of prime implicants as input. An implicant of a Boolean expression is a combination of predicates that makes the expression true, essentially a row of the truth table where the Boolean expression is true.
To obtain a list of prime implicants from a given Boolean expression, we need to calculate a truth table. This involves executing the expression (2^n) times, producing up to (2^n) implicants. For an expression with 10 predicates, we need to evaluate the expression 1024 times. With 20 predicates, the number of evaluations is 1,048,576. This is impractical.
Implicants are similar to Disjunctive Normal Form (DNF) minterms, as both evaluate the expression to be true. We could use DNF instead of implicants, but here is another challenge: even though the expressions below are equivalent:
(a ∧ b) ∨ (a ∧ ¬b)
a ∨ (a ∧ ¬b)
Only the first one is represented via implicants: (a ∧ b) is an implicant of the expressions, and can be simplified by QMC, while (a) is not, as it says nothing about the value of predicate (b).
Absorption law
Tackling the previous challenge, we face another one: now we need to find a way where we can compensate for the reduced power of QMC. Fortunately, this one can be resolved using Absorption Law:
x ∨ (x ∧ y) = x
With its help, we can simplify the expression from the previous example a ∨ (a ∧ ¬b) to just a.
Effective bitset representation
One proved to be good optimization is to use bitwise instructions on a bitset representation of Boolean expressions instead of working with the AST representation. This approach boosts performance by taking advantage of the speed and simplicity of bitwise operations, which are generally quicker and more straightforward than working with the more complex AST structure.
Petrick’s method
Can we do better and ensure the expression stays accurate while further reducing redundancy?
For example, for the given expression:
(¬a ∧ ¬b ∧ ¬c) ∨ (¬a ∧ b ∧ ¬c) ∨ (a ∧ ¬b ∧ ¬c) ∨ (¬a ∧ b ∧ c)
After some QMC reduction steps are applied, we end up with the following expression:
(¬a ∧ ¬c) ∨ (¬b ∧ ¬c) ∨ (¬a ∧ b)
However, this can be reduced further down to:
(¬b ∧ ¬c) ∨ (¬a ∧ b)
This redundancy comes from the fact that when we reduce two terms into one - the new term sort of covers (or we can say represents) two original ones, and some of the derivative terms can be removed without breaking the “coverage” of the original expression.
To find the minimal “coverage” we use
Petrick’s method
. The idea of this method is just to express the coverage as a Boolean expression and then find the minimal combination of its predicates which evaluate the expression to be true. It can be done by transforming the expression into DNF and picking up the smallest minterm.
Boolean expressions in MongoDB
In the previous section, we developed the algorithm for simple Boolean expressions. However, we can't use it to simplify filters in MongoDB just yet. The MongoDB Query Language (MQL) has unique logical semantics, particularly regarding negation, handling missing values, and array values.
Logical operators
MQL supports the following logical operators:
$and - regular conjunction operator.
$or - regular disjunction operator.
$not - Negation operator with special semantics, see below for details
$nor - Negation of disjunction, which is equivalent to conjunction of negations (DeMorgan Laws)
The only specific operator is $nor, which can be easily converted to the conjunction of negations.
MQL negation semantics
MongoDB's flexible schema can lead to some fields being missing in documents. For instance, in a collection with documents {a: 1, b: 1} and {a: 2}, the field b is missing in the second document.
When dealing with negations in MQL, missing values play an important role. Negations will match documents where the specified field is missing. Let's look at a collection of four documents and see how they answer to different queries (see Table 1):
Table 1:
Negating comparison expressions in MQL.
As shown, the $not operator matches documents where the field "a" is missing.
The $ne operator in MQL is only a syntactic sugar for {$not: {$eq: ...}}, which means it matches documents where the specified field is missing (see Table 2):
Table 2:
Negating Equality Expressions in MQL.
Let’s prove that for MQL negations Boolean algebra laws still hold. To do that we need to introduce new notations:
not* - MQL negation
exists - exists A means that the field of predicate A does exist
not-exists - is a short for not (exists A), and means that the field of predicate A does not exist
Being short for not (exists), not-exists behaves in the same way as a normal not:
not-exists(A and B) = not-exists(A) or non-exists(B)
not-exists(A or B) = not-exists(A) and non-exists(B)
The following equation is always true by the definition of MQL negation:
not* A = not A or not (exists A)
We also state that the predicate A cannot be true when its field is missing:
A and not-exists A = false
This is true even for operator {$exists: 0} in MQL: e.g. {$and: [{a: {$not: {$exists: 0}}}, {a: {$exists: 0}}]} always returns an empty set of documents. Internally {$exists: false} is implemented through negation of {$exists: 1}, hence the following statements are also true in MQL:
not* (not-exists A) = exists A
not* (exists A) = not-exists A
Complement laws
We need to prove that:
A and not* A = false
A or not* A = true
The proof of the first Complement law:
A and not* A =
A and (not A or not-exists A) =
(A and not A) or (A and not-exists A) =
false or false =
false
The proof of the second Complement law:
A or not* A =
A or (not A or not-exists A) =
A or not A or not-exists A =
true
DeMorgan laws
Proof of the first DeMorgan law not* (A and B) = (not* A) or (not* B):
not* (A and B) =
not (A and B) or not-exists (A and B) =
(not A or not B) or (not-exists A or not-exists B) =
(not A or not-exists A) or (not B or not-exists B) =
not* A or not* B
The proof of the second DeMorgan law not* (A or B) = (not* A) and (not* B) is a bit more complicated:
not* (A or B) =
not (A or B) or not-exists (A or B) =
(not A and not B) or (not-exists A and not-exists B) =
(not A or not-exists A) and (not B or not-exists A) and (not A or not-exists B) and (not B or not-exists B) =
((not* A) and (not* B)) and ((not A and not B) or (not A and not-exist B) or (not B and not-exists A) or (not-exists A and not-exists B)) =
((not* A) and (not* B)) and ((not* A) and (not* B)) =
(not* A) and (not* B)
// because
(not* A) and (not* B) = (notA and not B) or (not A and not-exist B) or (not B and not-exists A) or (not-exists A and not-exists B)
Involution law
We need to prove that not* (not* A) = A:
not* (not* A) =
not* (not A or not-exists A) =
not* (not A) and not* (not-exists A) =
(A or not-exists A) and (exists A) =
(A and exists A) or (not-exists A and exists A) =
A or false =
A
Here we used DeMorgan law for not* proved in the previous section.
MQL array values semantics
In MQL, array values behave as expected and don’t change the semantics of Boolean operators. However, they alter the semantics of operands within predicates. It’s helpful to think of it as implicitly adding the word “contains” to the operators interpretation:
{a: {$eq: 7}} means “array a contains a value equal 7”
{a: {$ne: 7}}, which is always equivalent to {a: {$not: {$eq: 7}}, means “array does not contain a value equal 7”
{a: {$gt: 7}} - “array a contains a value greater than 7”.
Since the Boolean simplifier focuses on the logical structure of expressions, rather than the internal meaning of predicates, it's irrelevant whether the value is an array or scalar.
The one important difference with array values semantics is that they behave differently on interval simplifications. If for a scalar value the following interval transformations are always valid, and invalid for arrays:
scalar > 100 && scalar <= 1 => scalar: () // empty interval
scalar < 100 && scalar >= 1 => scalar: [1, 100)
These differences are important for building Index Scan Bounds. However, as previously mentioned, they are not significant for the Boolean simplifier.
Implementation considerations
We faced several challenges during implementation. To achieve maximum performance, we created our own version of bitsets in C++. The existing options, such as boost::dynamic_bitset, were too slow, and std::bitset lacked flexibility. Memory management was crucial, so we worked hard to reduce memory allocations. This included using an optimized storage for our bitset and a
C++ polymorphic allocator
for some algorithms.
We released the simplifier cautiously, enabling it only for specific types of expressions. We plan to expand the range of expressions that can be simplified in the future. We also set a limit on the maximum number of terms in DNF. Before each transformation, we estimate the number of resulting terms. If the number is too high, we cancel the simplification.
Conclusion
In this post, we described an effective algorithm to simplify Boolean expressions. We used bitsets to represent query filters, modified the Quine-McCluskey algorithm, and applied Petrick's method. Finally, we proved that the algorithm works with MQL's Boolean algebra.
This journey began, as does so much at MongoDB, with a real-world customer example. The customer was struggling with a complex query, and we developed the Boolean Expression Simplifier to help them overcome their specific challenge—and ended up enhancing MongoDB's performance for any user dealing with intricate filters. As noted before, in one particularly demanding case involving large collections and selective indexes, the Simplifier helped achieve an 18,100% throughput improvement"
This project reflects MongoDB’s longstanding commitment to turning customer challenges into successes, and our practice of taking what we learn from individual projects and making broad improvements.
We’re always working to improve MongoDB’s performance, and the simplifier does this—it enables faster and more efficient query execution. Finally, this project is a good example of how MongoDB's culture of trust and ownership can lead to impact. Our engineers are empowered to innovate from concept to completion, transforming theoretical ideas like Boolean expression minimization into tangible gains that directly improve the customer experience.
Join our
MongoDB Community
to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.
January 30, 2025