Breaking the WiredTiger Logjam: The Write-Ahead Log (1/2)

MongoDB
May 3, 2017 | Updated: September 19, 2022
#Engineering #EngineeringBlog

Code can't be optimized; it can only be optimized for a set of conditions. When conditions change, optimizations can become bottlenecks, and when that happens, a thorough audit of assumptions might well hold the key to the solution.

The WiredTiger write-ahead log exemplifies this principle. It’s a critical codepath within a high-performance storage engine, and I have optimized it heavily to avoid I/O and locking. But some of the conditions I had initially targeted became invalid when WiredTiger became a storage engine in MongoDB. When a colleague of mine investigated a case of negative scaling found during testing, he uncovered a serious bottleneck in the write-ahead log… call it a "logjam". That investigation ultimately led us to rethink our assumptions and optimize for new conditions. We validated the new approach with a quick prototype, and then covered all the intricacies and edge cases to produce a fully realized solution.

In part one of this two-part series, I’ll dive deep into the innards of the WiredTiger write-ahead log. I’ll show how it orchestrates many threads writing to a single buffer without locking, and I’ll explain how two conflicts between that design and the new conditions produced the logjam. Part two will focus on how we eliminated the bottleneck. I’ll analyze its root causes, describe the key insight that enabled our solution, and detail the new algorithm and how it reflects our current conditions.

The write-ahead log

The write-ahead log is the durability feature that allows WiredTiger to survive a process or system crash. (MongoDB calls this the "journal".) Any thread writing data to WiredTiger first appends a record describing the write operation to the write-ahead log; in the event of a crash, any writes that were not persisted to the storage tables can be replayed from this log. The write-ahead log offers three durability modes. The strictest durability, ”full-sync”, guarantees the record is flushed to disk before returning, allowing the data to survive a system crash. The second level of durability, “write-only”, guarantees the record is written to the OS file system before returning to the user, allowing the data to survive a process crash. The least durable is “no-sync”, where the record is recorded in a buffer in memory but there is no guarantee it is immediately written to the file system.

Traditional write-ahead logs use mutexes to coordinate their writes. On single-core architectures this works well, but when more cores are available, it wastes a lot of time making threads wait for a mutex. To achieve better parallelization in our write-ahead log, I had adapted a design from a research paper, "Scalability of write-ahead logging on multicore and multisocket hardware", in the Very Large Databases Journal. Instead of writes being synced or written individually, many threads may copy their records into a single in-memory buffer, which can be written to the file system in a single call. The threads still have to wait for their desired level of durability to be achieved before they return, but the overall throughput of the write-ahead log is greater, because each thread isn’t making its own system call.

Instead of mutexes, atomic operations provide for isolation of writes, allowing all the threads to run without locking. This design could not work with a mutex controlling access to the in-memory buffer — that would just re-introduce the contention to a different layer in the call stack.

Traditional write-ahead logs also can’t offer a no-sync mode, but since WiredTiger already consolidated writes in memory, the ability to do so presented itself. It simply lets threads return without waiting for their writes to be written or synced. In this way, a set of logically related writes could be run much faster: client code could issue all but one write using no-sync and finish with a write-only or full-sync call.

The first hint of trouble

In the summer of 2015, Bruce Lucas, a Senior Technical Service Engineer at MongoDB, was helping make WiredTiger the default storage engine in MongoDB version 3.2. Bruce needs to have a high degree of expertise with the innards of MongoDB to support customers, so he was running experiments to study how WiredTiger performed under a wide variety of circumstances. One of those experiments, a workload consisting of many small inserts using a high number of threads per core, uncovered a case of negative scaling.

Here are results from an AWS Linux box with 8 cores running the WiredTiger code from MongoDB 3.0.4:

Using Poor Man's Profiler, which periodically invokes gdb to take full stack traces of all threads, Bruce was able to discern the bottleneck: busy-waits in WiredTiger's write-ahead log. As he dug deeper, he uncovered two mismatches between the design of the write-ahead log and the conditions under which it would run as a storage engine for MongoDB.

Condition mismatch #1: durability level

I first designed the write-ahead log before WiredTiger joined MongoDB, and I assumed that most applications using a write-ahead log would use write-only or possibly full-sync mode. This was the typical use case for customers at that time, who were using the WiredTiger library’s key-value store directly.

MongoDB, however, is exactly the kind of system that benefits from issuing a series of related no-sync writes concluding with a full-sync write. WiredTiger structures data as key/value pairs within tables, and those tables are what all of MongoDB’s data structures are built upon — collections, the oplog, indexes, and various internal MongoDB meta-data are all WiredTiger tables. A single write to a MongoDB document results in many write calls into several WiredTiger tables, and from a client perspective, unless all of them persist, none of the rest matter. So even when a call to MongoDB specifies that the journal must be synced to disk before a write returns, MongoDB can make use of the highly efficient no-sync mode for all but the last call. That now makes "no-sync" by far the most common case.

Looking deeper: the WiredTiger log slot

The prevalence of no-sync calls now puts WiredTiger’s write consolidation mechanism front and center. At it’s heart is a data structure called a "slot," using the terminology from the research paper. A slot encapsulates an in-memory buffer, related meta-data, and a special int64_t field called “slot_state” which counts the number of bytes claimed in the buffer. We reserve the lower integers in slot_state for internal state values, the highest of which is called READY. Threads first join a slot that is READY, claiming a spot in its buffer with an atomic operation on slot_state, and when the slot is closed to further joins, they all copy their data and release their claim with another atomic operation on slot_state.

WiredTiger maintains a pool of these slots. To write a record into the log, a thread attempts to join the currently active slot if it is in the READY state. The first step is to determine if the record fits within the remaining buffer space by adding the record size to the slot_state. If the total fits within the buffer, the thread performs an atomic compare-and-swap of slot_state with the new total size. On success, the prior value of slot_state indicates this thread’s offset into the buffer and its new value reflects the correct offset for the next thread to join (or the final size of this buffer).

The thread doesn’t write its payload yet! That write, and the associated bookkeeping, has to wait until the slot is closed to other joining threads — a task that belongs to the leader thread. So for now, it just waits, and watches slot_state for that closed indication.

A joining thread that gets an offset of zero is designated the leader. It doesn’t sit around waiting for the slot to close, that’s its job. But first it prepares a new slot from the pool, to ensure one will be ready for the subsequent round of joins and writes. This is the one task in the entire process that requires a lock, and when it’s over, the leader atomically inverts the sign on slot_state. A negative slot_state indicates that the slot is closed to further joins — the sign the other writers have been waiting for. They already know where their data belongs in the buffer, so they can copy in parallel. Job done, they release the slot by atomically adding their record size to slot_state, which is now the negative of the total number of bytes left to copy into the buffer. When slot_state reaches zero, all copies are complete and that slot is fully prepared for writing. The last thread to release the slot sees this zero and takes charge of writing the buffer to the OS, owing nothing to the rest of the threads, which have already gone on their merry way.

Condition mismatch #2: about as many cores as threads

The original algorithm is elegant and it works beautifully… as long as you have at most a few threads per core. This was another assumption that I made in my original design, based on the expectation that WiredTiger would be embedded in specialized applications, not a general-purpose database.

The availability of cores is important to threads that are waiting to copy records to the buffer, because they have to busy-wait until slot_state changes, consuming CPU cycles. Since I expected that the state would be updated very quickly and that there would be plenty of CPU available for the check, a busy-wait was safe when I wrote it. But MongoDB uses a different thread for every client connection. Some workloads, like those that perform a lot of tiny writes, create many more threads than cores.

The bottleneck

Individually, neither one of the condition mismatches would have been a disaster. But when there are a lot of threads and few cores, and when most of those threads are busy-waiting on a state change, we get a classic case of priority inversion. The thread in charge of managing slot_state and the slot pool ought to be top-priority, but all the threads that have joined a slot and are waiting for a change in slot_state have overwhelmed the CPUs. These threads do call sched_yield, but that doesn’t help much on Linux's scheduler.

Stay tuned for an entirely new approach

We couldn’t just replace yielding with a mutex, because that would kill performance under all loads. We needed to consider new optimizations based on these new conditions. It turned out that Bruce would be the one to crack the nut. While I was optimizing within the framework of the existing code, Bruce went off and experimented. Unencumbered by the concerns of the original design, he tried a completely different approach. He came up with a stand-alone prototype that held the promise of a massive improvement to our "no-sync" case. It was completely bare-bones — it didn’t even use any MongoDB or WiredTiger code — but it was so compelling that I had to try it out for real.

Join us next time for part 2, when we break that logjam!

← Previous

Secure Your Database With MongoDB Atlas

Say you’ve just built your application and for the data layer, you’ve chosen to deploy MongoDB on AWS EC2 instances that you will manage yourself. Pause for a moment and consider your self-managed MongoDB instances over the lifetime of your application. Now ask yourself these questions: Who will keep our database operating system up to date? Who will ensure the database software is recent?

May 2, 2017

Next →

Hanabi Technologies Uses MongoDB to Power AI Assistant, Hana

For all the hype surrounding generative AI, cynics tend to view the few real-world implementations as little more than “fancy chatbots.” But for Abhinav Aggarwal, CEO of Hanabi Technologies, the idea of a generative AI-powered bot that is more than just an assistant was intriguing. “I’d been using ChatGPT since it launched,” said Aggarwal. “That got me thinking: How could we make a chatbot that was like a team member?” And with that concept, Hana was born. The problem with bots “Most generative AI chatbots do not act like people; they wait for a command and give a response,” said Aggarwal. “We wanted to create a human-like chatbot that would proactively help people based on what they wanted—automating reminders, for example, or fetching time zones from your calendar to correctly schedule meetings.” Hanabi’s flagship product, Hana, is an AI assistant designed to enhance team collaboration within Google Chat, working in concert with Google Workspace and its suite of products. “Our target customers are smaller companies of between 10 and 50 people. At this size you’re not going to build your own agent from scratch,” he said. Hana integrates with Google APIs to deliver a human-like assistant that chimes in with helpful interventions, such as automatically setting reminders and making sure meetings are booked in the right time zone for each participant. “Hana is designed to bring AI to smaller companies and help them collaborate in a space where they are already working—Google Workspace,” Aggarwal explained. The MongoDB Atlas solution For Hana to act like a member of the team, Hanabi needed to process massive amounts of data to support advanced features like retrieval-augmented generation (RAG) for better information retrieval across Google Docs and many other sources. And with a rapidly growing user base of over 600 organizations and 17,000+ installs, Hanabi also required a secure, scalable, and high-performing data storage solution. MongoDB Atlas provided a flexible document model, built-in vector database, and scalable cloud-based infrastructure, freeing Hanabi engineers to build new features for Hana rather than focusing on rote tasks like data extract, transform, and load processes or manual scaling and provisioning. Now, MongoDB Atlas handles a variety of responsibilities: Scalability and security: MongoDB Atlas’s auto-scaling and automatic backup features have enabled Hanabi to seamlessly grow its user base without the need for manual database management. RAG: MongoDB Atlas plays a critical role in Hana’s RAG functionality. The platform enables Hanabi to split Google Docs into small sections, create embeddings, and store these sections in Atlas’s vector database. Development Processes: According to Aggarwal, MongoDB’s flexibility in managing changing schemas has been essential to the company’s fast-paced development cycle. Data Visualization: Using MongoDB Atlas Charts has enabled Hanabi to create comprehensive dashboards for real-time data visualization. This has helped the team track usage, set reminders, and optimize performance without needing to build a manual dashboard. Impact and results With MongoDB Atlas, Hanabi can successfully scale Hana to meet the demands of its rapidly expanding user base. The integration is also enabling Hana to offer powerful features like automatic interactions with customers, advanced information retrieval from Google Docs, and manually added memory snippets, making it an essential tool for teams around the world. Next steps Hanabi plans to continue integrating more tools into Hana while expanding its reach to personal Gmail users. The company is also rolling out a new automatic-interaction feature, further enhancing Hana’s ability to proactively assist users without direct commands. MongoDB Atlas remains a key component of Hanabi’s stack, alongside Google Kubernetes Engine, NestJS, and LangChain, enabling Hanabi to focus on innovating to improve the customer experience. Tech Stack MongoDB Atlas Google Kubernetes Engine NestJS LangChain Are you building AI apps? Join the MongoDB AI Innovators Program today! Successful participants gain access to free MongoDB Atlas credits, technical enablement, and invaluable connections within the broader AI ecosystem. If your company is interested in being featured, we’d love to hear from you. Connect with us at ai_adopters@mongodb.com.

November 21, 2024