Breaking the WiredTiger Logjam: The Wait-Free Solution (2/2)

MongoDB
May 16, 2017 | Updated: September 19, 2022
#Engineering #EngineeringBlog

Part one of this pair explored the original algorithm the WiredTiger write-ahead log used to consolidate writes in order to minimize IO. It used atomic compare-and-swap operations in two phases to accomplish this without time-consuming locking. This algorithm worked extremely well as long as there were no more than a few threads running per core. But its reliance on busy-waiting to avoid locking caused a logjam when the number of threads increased beyond that limit -- a serious problem given that many MongoDB workloads would have a large number of threads per core. This issue was blocking MongoDB’s goal of making WiredTiger the default storage engine in v3.2.

This story has a happy ending thanks to my colleague, Senior Technical Service Engineer Bruce Lucas. Bruce had initially uncovered the logjam and reported it to me; together, we overcame it without compromising any other workloads. Because Bruce’s mindset was not colored by the legacy of the original approach, he was able to provide the critical insight that paved the way for the solution, allowing WiredTiger to become the default storage engine in v3.2.

Why did threads have to wait?

It was the summer of 2015, and I was trying to eliminate the logjam Bruce had uncovered. I was working on improving performance by reducing thread contention, but I was making only incremental progress. Bruce, on the other hand, went in a very different direction. He questioned the very need for threads to wait before copying their payloads, and he set about writing a prototype to prove they didn’t have to.

We can't prohibit MongoDB from using a lot of threads -- it's designed to do that. Nor can we mandate that it only run on massively multi-core machines. Using mutexes was an absolute non-starter. Bruce knew those were dead-ends, so he scrutinized the conceptual building blocks of the algorithm:

Claiming a place in a slot (which requires atomicity) can be decoupled from copying to it (which can be done in parallel).
Claiming a place in a slot atomically, which we call joining, can be done via compare-and-swap operations on an index variable.
That index is identical to the count of total bytes claimed, so we call it the join counter.
A slot can’t be written to the OS until all the threads that have joined it have actually performed their copies, so slots have to track when threads complete their copies.
Tracking bytes written can be done with atomic operations on a release counter. When its release == join, a slot is ready to be written to the OS.
We must also track a slot’s state, so threads know when it is unavailable, joinable, ready to write to the OS, etc.

Individually, none of these items require threads to wait; the need arises from their interaction. In order to safely write a slot to the OS, for example, a thread has to determine that the slot’s state allows it, and that any thread that has claimed a spot in its buffer (join) has completed copying its data (release) -- and this three component check must be done atomically.

The problem is that CPUs do not allow atomic operations involving more than two registers. (Footnote: Theoretically they could, but in practice none do.) If join, release, and state were tracked in separate variables, we could compare state with a READY_TO_WRITE value, or join with release, but not both at once.

Thus, to implement the atomic operations, a single register (variable) must be used to multiplex a slot’s state along with the bookkeeping about joined and released bytes. This is precisely the slot_state field described in part one.

It is tempting to allow threads to increment slot_state as they join and decrement it as they release, but item #2 on our list forbids it: slot_state must always point to the next free byte in the buffer. Allowing a thread to decrement slot_state before joins are complete would point slot_state at memory that was already claimed by other threads. Keeping an independent pointer into the buffer that only increments would solve that issue, but it would defeat atomicity.

In summary: the need for atomicity constrains us to using a single variable, and the need to track where threads can write means we cannot mingle increments and decrements. Therefore we must have two phases. In the join phase, threads claim space, but must then wait for the release phase to begin; in the release phase, threads write to the space they claimed and mark their bytes written.

An epiphany

If only we could maintain two separate counters for join and release, we could eliminate the need for threads to wait. We could let them write into the slot as soon as they received their write offset from their join operation.

But Bruce noticed something critical: these counters fit easily within 32 bits, so they could both fit inside an int64. We could logically split a single register into the pieces necessary to maintain all of the required information: the slot state and the two counters.

With this scheme, we can implement joins and releases with masking and bit-shifting, which Bruce cleaned up using a few macros:

// put together and pull apart two 32-bit counters from 64-bit slot state
#define JOINED_RELEASED(joined, released) (((joined)<<32) + (released))
#define JOINED(state) ((state)>>32)
#define RELEASED(state) ((int64_t)(int32_t)(state))

// a simple join:
old_state = slot->state;
offset = JOINED(old_state);
old_release = RELEASED(old_state);
new_state = JOINED_RELEASED(offset + my_size, old_release)
// Now atomically set old_state, new_state

Bruce wrote a program demonstrating this method, not using any WiredTiger or MongoDB code, to test his idea. He simulated the multithreaded load and ran the numbers, and the results encouraged him to whip together a patch for WiredTiger. In proof-of-concept mode, he ignored all the details: rolling over journal files, records that are too large to store in a buffer, errors and timeouts that interrupt the flow of data, and more. Bruce skipped all that, but his patch was enough to prove that even in the context of a server doing lots of other processing, his optimization to the write-ahead log was substantial.

Without the need to wait for the join/release phase change, threads can claim a spot, write their payload, record their bytes written, and leave, without ever waiting. This implementation does away with the need for a "leader" thread and divvies up the responsibilities between two threads. When a thread performs a join that fills the buffer, it closes the slot and prepares a new one. When a thread’s release completes with no other pending writes and a full buffer, it writes the buffer to the OS.

Going from POC to production code

When Bruce sent me his first patch, I was hopeful but a little bit daunted! His solution attacked the problem from an angle that I had never even considered, but there were so many details that were not accounted for; it would be a lot of work to reconcile with the existing write-ahead logging code. But the performance improvements were so significant, it was clearly worth trying to make it work. I set about filling in the gaps. Bit by bit, over the next couple of weeks, I addressed the complexities. As I made my way down the list, my cautious optimism became out-and-out enthusiasm, until finally I had a fully realized write-ahead log implementation using the new method.

Together, the code for joining, copying and releasing now looks something like this:

/* Join my record size into the existing slot */
old_state = slot->state;
new_state = old_state + join_state(my_size, &my_offset);
/* Retry if we race on the atomic operation */
if (!atomic_cas(slot->state, old_state, new_state))
    go retry reading old_state;

/* Prepare a new buffer if this one is full */
if (my record fills buffer)
    close and switch slot

/* Copy my record */
memcpy(buffer + my_offset, my_record, my_size);

/* Release my size after copy */
old_state = slot->state;
new_state = old_state + release_state(my_size);
if (!atomic_cas(slot->state, old_state, new_state))
    go retry reading old_state;

/* If buffer is full and I’m the last to finish, write */
if (buffer is full and my release is the last one)
    write_buffer_to_OS();

An important detail: idle systems

Because filling the slot’s buffer is the trigger to write the records to the OS, the algorithm works well with a steady flow of incoming records. But if a system goes idle, any records in a current unfilled buffer will sit unflushed until either enough writes come in to fill the buffer, or a write using j:true forces a sync. While technically the records in that buffer were written explicitly without durability guarantees, records should not remain unflushed while a system is idle! To address this, we added a 50-millisecond idle timeout that pushes the buffer to the OS, limiting how long a record is exposed to the risk of a process crash and MongoDB syncs to disk every 100-milliseconds to limit the risk of system crash.

It's much much faster

Measurements of the problematic workload against production code were very exciting:

We had nearly tripled performance of the journal algorithm, and had almost entirely eliminated the negative scaling at high thread counts without harming performance at low thread counts. The WiredTiger team has a standard suite of benchmarks, and some of those benefited more than others, but none were penalized by the changes.

Final thoughts

The impact of code changes at the storage layer are often undetected, as they are eclipsed by the overhead of the many layers above. The opportunity to have conspicuous, user visible improvements like this are rare, and offer a particularly novel variety of job satisfaction.

Finally, optimizing code for a particular set of conditions does more than specialize the code -- it also specializes your thoughts. As your thoughts bore deeper and deeper into the problem space, they leave tracks, which become trails, and eventually paths, which your thoughts then naturally continue to follow. So when you encounter a need to make code suit a completely new environment, it helps to have a friend who isn't influenced by your preconceptions.

← Previous

Deep Learning and the Artificial Intelligence Revolution: Part 3

Welcome to part 3 of our 4-part blog series. In part 1 we looked at the history of AI, and why it is taking off now In part 2 , we discussed the differences between AI, Machine Learning, and Deep Learning In today’s part 3, we’ll dive deeper into deep learning and evaluate key considerations when selecting a database for new projects We’ll wrap up in part 4 with a discussion on why MongoDB is being used for deep learning, and provide examples of where it is being used If you want to get started right now, download the complete Deep Learning and Artificial Intelligence white paper. What is Deep Learning? Deep learning is a subset of machine learning that has attracted worldwide attention for its recent success solving particularly hard and large-scale problems in areas such as speech recognition, natural language processing, and image classification. Deep learning is a refinement of ANNs, which, as discussed earlier, “loosely” emulate how the human brain learns and solves problems. Before diving into how deep learning works, it’s important to first understand how ANNs work. ANNs are made up of an interconnected group of neurons, similar to the network of neurons in the brain. Figure 1: The Neuron Model At a simplistic level, a neuron in a neural network is a unit that receives a number of inputs (xi), performs a computation on the inputs, and then sends the output to other nodes or neurons in the network. Weights (wj), or parameters, represent the strength of the input connection and can be either positive or negative. The inputs are multiplied by the associated weights (x1w1, x2w2,..) and the neuron adds the output from all inputs. The final step is that a neuron performs a computation, or activation function. The activation function (sigmoid function is popular) allows an ANN to model complex nonlinear patterns that simpler models may not represent correctly. Figure 2: Neural Network Diagram Figure 2 represents a neural network. The first layer is called the input layer and is where features (x1, x2, x3) are input. The second layer is called the hidden layer. Any layer that is not an input or output layer is a hidden layer. Deep learning was originally coined because of the multiple levels of hidden layers. Networks typically contain more than 3 hidden layers, and in some cases more than 1,200 hidden layers. What is the benefit of multiple hidden layers? Certain patterns may need deeper investigation that can be surfaced with the additional hidden layers. Image classification is an area where deep learning can achieve high performance on very hard visual recognition tasks – even exceeding human performance in certain areas. Let’s illustrate this point with an example of how additional hidden layers help perform facial recognition. Figure 3: Deep Learning Image Recognition When a picture is input into a deep learning network, it is first decomposed into image pixels. The algorithm will then look for patterns of shapes at certain locations in the image. The first hidden layer might try to uncover specific facial patterns: eyes, mouth, nose, ears. Adding an additional hidden layer deconstructs the facial patterns into more granular attributes. For example, a “mouth” could be further deconstructed into “teeth”, “lips”, “gums”, etc. Adding additional hidden layers can devolve these patterns even further to recognize the subtlest nuances. The end result is that a deep learning network can break down a very complicated problem into a set of simple questions. The hidden layers are essentially a hierarchical grouping of different variables that provide a better defined relationship. Currently, most deep learning algorithms are supervised; thus, deep learning models are trained against a known truth. How Does Training Work? The purpose of training a deep learning model is to reduce the cost function, which is the discrepancy between the expected output and the real output. The connections between the nodes will have specific weights that need to be refined to minimize the discrepancy. By modifying the weights, we can minimize the cost function to its global minimum, which means we’ve reduced the error in our model to its lowest value. The reason deep learning is so computationally intensive is that it requires finding the correct set of weights within millions or billions of connections. This is where constant iteration is required as new sequences of weights are tested repeatedly to find the point where the cost function is at its global minimum. Figure 4: Deep Learning Cost Function A common technique in deep learning is to use backpropagation gradient descent. Gradient descent emerged as an efficient mathematical optimization that works effectively with a large number of dimensions (or features) without having to perform brute force dimensionality analysis. Gradient descent works by computing a gradient (or slope) in the direction of the function global minimum based on the weights. During training, weights are first randomly assigned, and an error is calculated. Based on the error, gradient descent will then modify the weights, backpropagate the updated weights through the multiple layers, and retrain the model such that the cost function moves towards the global minimum. This continues iteratively until the cost function reaches the global minimum. There may be instances where gradient descent resolves itself at a local minimum instead of global minimum. Methods to mitigate this issue is to use a convex function or generate more randomness for the parameters. Database Considerations with Deep Learning Non-relational databases have played an integral role in the recent advancement of the technology enabling machine learning and deep learning. The ability to collect and store large volumes of structured and unstructured data has provided deep learning with the raw material needed to improve predictions. When building a deep learning application, there are certain considerations to keep in mind when selecting a database for management of underlying data. Flexible Data Model . In deep learning there are three stages where data needs to be persisted – input data, training data, and results data. Deep learning is a dynamic process that typically involves significant experimentation. For example, it is not uncommon for frequent modifications to occur during the experimentation process – tuning hyperparameters, adding unstructured input data, modifying the results output – as new information and insights are uncovered. Therefore, it is important to choose a database that is built on a flexible data model, avoiding the need to perform costly schema migrations whenever data structures need to change. Scale . One of the biggest challenges of deep learning is the time required to train a model. Deep learning models can take weeks to train – as algorithms such as gradient descent may require many iterations of matrix operations involving billions of parameters. In order to reduce training times, deep learning frameworks try to parallelize the training workload across fleets of distributed commodity servers. There are two main ways to parallelize training: data parallelism and model parallelism. Data parallelism . Splitting the data across many nodes for processing and storage, enabled by distributed systems such as Apache Spark, MongoDB, and Apache Hadoop Model parallelism . Splitting the model and its associated layers across many nodes, enabled by software libraries and frameworks such as Tensorflow, Caffe, and Theano. Splitting provides parallelism, but does incur a performance cost in coordinating outputs between different nodes In addition to the model’s training phase, another big challenge of deep learning is that the input datasets are continuously growing, which increases the number of parameters to train against. Not only does this mean that the input dataset may exceed available server memory, but it also means that the matrices involved in gradient descent can exceed the node’s memory as well. Thus, scaling out, rather than scaling up, is important as this enables the workload and associated dataset to be distributed across multiple nodes, allowing computations to be performed in parallel. Fault Tolerance . Many deep learning algorithms use checkpointing as a way to recover training data in the event of failure. However, frequent checkpointing requires significant system overhead. An alternative is to leverage the use of multiple data replicas hosted on separate nodes. These replicas provide redundancy and data availability without the need to consume resources on the primary node of the system. Consistency . For most deep learning algorithms it is recommended to use a strong data consistency model. With strong consistency each node in a distributed database cluster is operating on the latest copy of the data, though some algorithms, such as Stochastic Gradient Descent (SGD), can tolerate a certain degree of inconsistency. Strong consistency will provide the most accurate results, but in certain situations where faster training time is valued over accuracy, eventual consistency is acceptable. To optimize for accuracy and performance, the database should offer tunable consistency. Wrapping Up Part 3 That wraps up the third part of our 4-part blog series. We’ll conclude in part 4 with a discussion on why MongoDB is being used for deep learning, and provide examples of where it is being used Remember, if you want to get started right now, download the complete the complete Deep Learning and Artificial Intelligence white paper.

May 15, 2017

Next →

MongoDB’s 2024 Year in Review

It’s hard to believe that another year is almost over! 2024 was a transformative year for MongoDB, and it was marked by both innovation and releases that further our commitment to empowering customers, developers, and partners worldwide. So without further ado, let’s dive into MongoDB’s 2024 highlights. We’ll also share our executive team’s predictions of what 2025 might have in store. A look back at 2024 MongoDB 8.0: The most performant version of MongoDB ever In October we released MongoDB 8.0 , the fastest, most resilient, secure, and reliable version of MongoDB yet. Architectural optimizations in MongoDB 8.0 have significantly improved the database’s performance, with 36% faster reads and 59% higher throughput for updates. Our new architecture also makes horizontal scaling cheaper and faster. Finally, working with encrypted data is easier than ever, thanks to the addition of range queries in Queryable Encryption (which allows customers to encrypt, store, and perform queries directly on data). Whether you’re a startup building your first app, or you’re a global enterprise managing mission-critical workloads, MongoDB 8.0 offers unmatched power and flexibility, solidifying MongoDB’s place as the world’s most popular document database. Learn more about what makes 8.0 the best version of MongoDB ever on the MongoDB 8.0 page . Delivering customer value with the MongoDB AI Applications Program AI applications have become a cornerstone of modern software, and MongoDB is committed to equipping customers with the technology, tools, and support they need to succeed on their AI journey. That’s why we launched the MongoDB AI Applications Program (MAAP) in 2024, a comprehensive program designed to accelerate the development of AI applications. By offering customers resources like access to AI specialists, an ecosystem of leading AI and tech companies, and AI architectural best practices supported by integrated services, MAAP helps solve customers’ most pressing business challenges, unlocks competitive advantages, and accelerates time to value for AI investments. Overall, MAAP’s aim is to set customers on the path to AI success. Visit the MongoDB AI Applications Program page or watch our session from AWS re:Invent to learn more! Advancing AI with MongoDB Atlas Vector Search In 2024, MongoDB further cemented its role in the AI space with enhancements to MongoDB Atlas Vector Search . Recognized in 2024 (for the second consecutive year!) as one of the most loved vector databases , MongoDB continues to provide a scalable, unified, and secure platform for building cutting-edge AI use cases. Recent advancements like vector quantization in Atlas Vector Search help deliver even more value to our customers, enabling them to scale applications to billions of vectors at a lower cost. Head over to our Atlas Vector Search quick start guide to get started with Atlas Vector Search today, or visit our AI resources hub to learn more about how MongoDB can power AI applications. Search Nodes: Performance at scale Search functionality is indispensable in modern applications, and with Atlas Search Nodes, organizations can now optimize their search workloads like never before. By providing dedicated infrastructure for Atlas Search and Vector Search workloads, Search Nodes ensure high performance (e.g., a 40–60% decrease in query times), scalability, and reliability, even for the most demanding use cases. As of this year , Search Nodes are generally available across AWS, Google Cloud, and Microsoft Azure. This milestone underscores MongoDB’s commitment to delivering powerful solutions that scale alongside our customers’ needs. To learn more about Search Nodes, check out our documentation or watch our tutorial . Looking ahead: MongoDB’s 2025 predictions After the excitement of the past few years, 2025 will be defined by ensuring that technology investments deliver tangible value. Organizations remain excited about the potential AI and emerging technologies hold to solve real business challenges, but are increasingly focused on maintaining a return on investment. “Enterprises need to innovate faster than ever, but speed is no longer the only measure of success. Increasingly, organizations are laser-focused on ensuring that their technology investments directly address critical business challenges and provide clear ROI and competitive advantage—whether it’s optimizing supply chains, delivering hyper-personalized customer experiences, or scaling operations efficiently,” said Sahir Azam, Chief Product Officer at MongoDB. “In 2025, I expect to see organizations make significant strides in driving this innovation and efficiency by applying AI to more production use cases and by maturing the way they leverage their data to build compelling and differentiated customer experiences.” Indeed, we expect to see organizations make more strategic investments in emerging technologies like gen AI—innovating with a sharp focus on solving business challenges. “In 2025, we can expect the focus to shift from ‘what AI can do’ to ‘what AI should do,’ moving beyond the hype to a clearer understanding of where AI can provide real value and where human judgment is still irreplaceable,” said Tara Hernandez, VP of Developer Productivity at MongoDB. “As we advance, I think we’ll see organizations begin to adopt more selective, careful applications of AI, particularly in areas where stakes are high, such as healthcare, finance, and public safety. A refined approach to AI development will be essential—not only for producing quality results but also to build trust, ensuring these tools genuinely support human goals rather than undermining them.” With more capable, accessible application development tools and customer-focused programs like MAAP at developers’ fingertips, 2025 is an opportunity to make a data-driven impact faster than ever before. "Right now, organizations have an opportunity to leverage their data to reimagine how they do business, to more effectively adapt to a changing world, and to revolutionize our quality of life,” said Andrew Davidson, SVP of Products at MongoDB. “By harnessing our latest technologies, developers can build a foundation for a transformative future." Head over to our updates page to learn more about the new releases and updates from MongoDB in 2024. Keep an eye on our events page to learn what's to come from MongoDB in 2025!

December 19, 2024