When Switching Projects, Check your Assumptions or Risk Disaster
On January 10, I released a badly broken version of the MongoDB C Driver, libmongoc 1.5.2. For most users, that version could not connect to a server at all! Luckily, in under 24 hours a developer reported the bug, I reverted the mistake and released a fix. Although it was resolved before it did any damage, this is among the most dramatic mistakes I've made since I switched from the PyMongo team to libmongoc almost two years ago. My error stemmed from three mistaken assumptions I've had ever since I changed projects. What were they?
Inception
Here's how the story began. In December, a libmongoc user named Alexey pointed out a longstanding limitation: it would only resolve hostnames to IPv4 addresses. Even if IPv6 address records existed for a hostname, the driver would not look them up -- when it called
getaddrinfo
on the hostname to do the DNS resolution, it passed
AF_INET
as the address family, precluding anything but IPv4. So if you
passed the URI
mongodb://example.com
, libmongoc resolved "example.com" to an IPv4 address like
93.184.216.34
and tried to connect to it. If the connection timed out, the driver gave up.
The driver did support IPv6 connections, but the only way to use them was to pass in an IPv6 address like this one:
mongodb://[2606:2800:220:1:248:1893:25c8:1946]
...which makes the driver call
getaddrinfo
with the raw address string and
AF_INET6
, resulting in an IPv6 address. To fully support IPv6, libmongoc should call
getaddrinfo
with
AF_UNSPEC
instead. Then
getaddrinfo
would return a list of both IPv6 and IPv4 addresses for "example.com", to which the driver should try to connect, each in turn, until a connection succeeds.
Let's look at some code and see why the driver didn't yet achieve
this standard
.
Here is where the address family, either IPv4 or IPv6, is determined in the code that parses the
MongoDB URI
:
const char *host;
/* extract the host substring of a URI */
host = mongoc_uri_parse_hostname (uri);
if (strchr (host, ':')) {
family = AF_INET6; /* means "IPv6 only" */
} else {
family = AF_INET; /* means "IPv4 only" */
}
You can see that if the driver finds any ":" characters in the host string it sets the family to IPv6, for which ":" is a required component of an address. Otherwise it uses IPv4, where the “:” is prohibited in both hostnames and addresses.
Later, in a different function, in a different file, the address family is used for DNS resolution:
/* host is "example.com", port is 27017,
* address_family came from the code above
*/
int mongoc_connect (const char *host,
uint16_t port,
int address_family)
{
int sock;
struct addrinfo hints = { 0 };
struct addrinfo *result;
hints.ai_family = address_family;
/* hostname resolution:
* get a list of IP addresses for "host"
*/
getaddrinfo (host, port, &hints, &result);
/* connect to first IP address */
sock = socket (result[0].ai_family,
result[0].ai_socktype,
result[0].ai_protocol);
if (mongoc_socket_connect (sock) == -1) {
/* give up */
return -1;
}
return sock;
}
As Alexey pointed out, this logic prevented us from translating a hostname like "example.com" into an IPv6 address. He submitted
a one-line patch
for the URI-parsing code:
if (strchr (host, ':')) {
family = AF_INET6;
} else {
- family = AF_INET; /* means "IPv4 only" */
+ family = AF_UNSPEC; /* means "any family" */
}
This change looked good to me: given a hostname, libmongoc should try to connect over IPv6, IPv4, or whatever
getaddrinfo
gives us. Hannes, libmongoc's other maintainer, also reviewed the patch. It was simple and it passed our tests, so we accepted the change and decided to include it in our upcoming release, version 1.5.2.
This one-line change broke the driver.
Assumption one: Reality matches imagination
Just from looking at Alexey's patch, I might have realized that it had only accomplished half an improvement. It switched the address family from
AF_INET
to
AF_UNSPEC
, so now the driver would resolve a hostname to a list of IPv6 and IPv4 addresses. But, it did not try connecting to each address until one succeeded; it only tried the first. If the first address was IPv6 and the MongoDB server was only listening on IPv4, the whole procedure failed.
I assumed that libmongoc already implemented a loop to try each address in turn, because my understanding of libmongoc was biased from time spent developing PyMongo's code. I knew that, from its early days, PyMongo implemented the following loop:
# Resolve IPv6 and IPv4 addresses, try each until one succeeds.
for result in getaddrinfo(host, port, AF_UNSPEC, SOCK_STREAM):
family, socktype, proto, canonname, sockaddr = result
try:
sock = socket.socket(family, socktype, proto)
sock.settimeout(connect_timeout)
sock.connect(sockaddr)
# No exception was thrown, success.
return sock
except socket.error as e:
pass
raise Exception("couldn't connect")
(I am omitting many details, of course.)
PyMongo had "always" worked this way, as far as I knew. You passed it a hostname and PyMongo resolved it to a list of addresses, then tried to connect to each address until one succeeded. I imagined that libmongoc must, too.
Imagining code is a necessary step when you take over a large project—you will never read all the code. But if you do not distinguish code you imagined from the code you've seen with your own eyes, then you become prone to making decisions without evidence. When I reviewed Alexey's patch, I only saw a few lines surrounding the change. I imagined that a loop similar to PyMongo's was in libmongoc, in a different function in another file, and in my complacency, I never verified that what I imagined was truly so.
Maybe it's because we love technology so much, when many of us imagine a machine we assume each part of it has been fully and beautifully developed. William Gibson, a founding cyberpunk writer, said:
It wasn't until I could finally afford a computer of my own that I found out there's a drive mechanism inside — this little thing that spins around. I'd been expecting an exotic crystalline thing, a cyberspace deck or something, and what I got was a little piece of a Victorian engine that made noises like a scratchy old record player.
I notice this kind of imagination among programmers all the time. An intern asked me last year, "Where are all the MongoDB error codes documented?" It didn't occur to her that we haven't yet documented them all. Or, a colleague asked me the other day, "What philosophy governs which MongoDB drivers implement which features?" He assumed a profound principle where, in many cases, the answer was that not all drivers are completed. When we sketch in our minds how something unfamiliar must have been done, we don't sketch some parts as unfinished—we imagine a completed whole.
I had assumed so confidently that libmongoc worked the way I imagined, that I never verified it. The best defense against this assumption is to remember which code you've imagined and which you've actually read.
Assumption two: Ordinary features don't need special tests
I consider IPv4 very ordinary, and in contrast I think IPv6 is exotic. I've been obsessed with ensuring the latter is tested thoroughly.
Last year, before the events of this story, I had said, "Let's make sure we run MongoDB with IPv6 enabled in
Evergreen
so our libmongoc tests always exercise the IPv6 code." On every one of our menagerie of supported platforms, I made sure we started the MongoDB server with --ipv6: Mac, Windows, Linux, ARM processors, MinGW, Solaris, IBM s390x mainframes, and on and on. Then, for each platform, I added a test that connects to localhost by its IPv6 address, mongodb://[::1].
"That takes care of that," I thought.
Unfortunately I'd just punched a different hole in the test matrix. We could introduce a bug in IPv4 support and never know it. I'd assumed that since IPv4 is so ordinary, it needed no special test. It would be immediately obvious if it stopped working.
Do you remember
when Richard Feynman demonstrated
, with a piece of rubber and a cup of ice water, that the Challenger shuttle had been destroyed because its O-rings were too stiff to ensure a seal when they got cold? (Did you know that
it was astronaut Sally Ride who first uncovered this flaw
? She slipped the evidence in secret to another scientist, to avoid reprisal from NASA.) There are many painful lessons from that accident, but a big one was this: NASA engineers were so obsessed with the extreme heat their O-rings must withstand, they ignored the chill of a typical winter morning on Earth. Like me, they had guarded against the exotic threat, but it was the ordinary one that led to destruction.
When I merged the IPv6 patch I was oblivious to the threat of breaking IPv4 support. Everything seemed fine: On my machine and Hannes's, the driver kept working. All the tests passed. We released libmongoc 1.5.2 and celebrated with some whiskey. (I'm in New York and Hannes is in Palo Alto, so we had to toast each other virtually.)
But all was not well. Remi Collet, a developer at RedHat who maintains the libmongoc package in Fedora, tested our release a few hours after we announced it. On his test rig, Remi does not run MongoDB with IPv6 enabled. He immediately found that libmongoc 1.5.2 cannot connect to
mongodb://localhost
if MongoDB is only listening on IPv4. He filed
CDRIVER-1988
and I diagnosed the problem soon after. Although 1.5.2 tries IPv6 first, if it can't connect, it doesn't fall back to IPv4. It was obvious, but it required someone outside our team to notice it.
I had assumed that if I broke such an ordinary feature the bug would not go undetected. The best defenses against this assumption are working with an outsider with an independent perspective, and thorough testing for even the most ordinary features.
Assumption three: Any bug that can be fixed, should be fixed
When I took over libmongoc more than a year ago, I was accustomed to my previous, luxurious position on the PyMongo team. PyMongo is mature: it's been well-staffed for years and it has few open bugs, if any. Therefore, whenever a bug is reported, it is usually a recent bug. If someone finds a bug in PyMongo 3.2, we probably introduced the bug in 3.2 when we implemented some new feature. We can diagnose it easily and release a fix in PyMongo 3.2.1, returning our bug count to zero.
Since 2015, we have typically resolved PyMongo bugs within weeks:
But libmongoc is not like PyMongo. It's less mature and, until recently, had only a fraction of the programmer-hours devoted to it. Besides, as everyone knows, C is harder than Python. Therefore, there were periods when the average age of the issues we resolved was more than two months.
Hannes and I have made steady headway reducing the libmongoc bug count and resolving its urgent bugs. Some old bugs still linger, however, and I have to be disciplined about when to fix them. We cannot squash every known bug with every release. The benefit of fixing each bug must be assessed against the risk of creating a worse one.
The other adjustment I had to make was in my understanding of
Semantic Versioning
. When I worked on PyMongo, I understood it like this:
This is too ambitious for libmongoc. Rather than fixing all the bugs in every patch release, my new policy is: "Fix new bugs. Leave other bug fixes for the next minor release." Here's my adjusted understanding:
Patch releases should be minimally risky. With few exceptions, they should repair some part of the code that we unwittingly broke while adding features in the last minor version. To put it more bureaucratically, a patch release should fix only "regressions."
When Alexey's pull request came in, I forgot my own policy: I included his change in a patch release. The bug it fixed wasn't new; we should have waited for 1.6.0. We might still have released a broken driver then, but the risk was more appropriate for a minor version bump. Users have the right to expect we'd minimize risk in a patch release.
It's as if I'd made a beautiful stack of pebbles and, just before I finished, my ambition got the better of me and I tried to balance just one more tiny pebble at the apex. This was the moment of failure. This is when I turned my back on my precariously balanced tower and,
seconds later, it collapsed
. This is when I screwed up the libmongoc 1.5.2 release.
I assumed that we should fix every known bug as soon as possible. The best defense against this assumption is to be disciplined: put off most changes for the next minor version.
Redemption
I made a significant mistake in releasing libmongoc 1.5.2, but Remi Collet's prompt bug report saved us. It helped that Remi had an independent perspective. It helped, too, that I've invested time improving our release automation over the last few months, so it only took minutes to revert the buggy patch and publish libmongoc 1.5.3. With the wisdom gained from this painful episode, I'm less likely to fall prey to my mistaken assumptions again.
March 30, 2017