For the last few weeks, we've seen a pretty constant but very small trickle in connection issues in our internal network. Application servers would complain about connections to databases timing out. Not all database servers were affected, but a fair chunk of them were, including some of the more important master databases. Nothing could explain these errors: no errors or packet loss were reported on the network side, no errors were seen on the MySQL side. So how do you debug this? This is how we (well, they. I wasn't involved personally) walked through it.
We have a metric buttload of graphs for our system. Even in imperial units it's a huge number. Most of them are stored in graphite, but we also have our homegrown monitoring system. The latter includes a set of graphs for application errors. We can group these various ways, but the nice one here was grouping them by exact error message. This told us that the problem happened on many machines, but more importantly: there was a pattern! The problem would happen for a few seconds each hour. Not on all machines at the same time (so, not a cronjob), but still: every hour.
Ok, so we can now predict when the problem will reoccur. This is great if you have no ideas, because to get an idea you need to collect data. Our database servers do an awful lot of traffic, and even if you manage to capture an error, finding it is ridiculously difficult if you can't narrow down your search window. So, we tcpdumped and caught one of the errors. Now the searching starts. What we found, was that during these problems there was some spurious DNS traffic going to unassigned IP addresses. The plot thickens...
Look at the post title. I gave away the solution didn't I? Not all of it though, because what follows now is a tale of hysterical raisins, paranoia, high availability and glibc.
It is reasonably well known that glibc's resolver (nss_dns) never reloads resolv.conf. This misfeature means that you need to use a local caching daemon, such as nscd or dnsmasq, to make sure DNS actually works for you if you don't use the same nameserver 100% of the time. Or you patch all your applications...
In a datacenter this usally is much less important than on a desktop or laptop, as nameservers rarely change. Though we still use nscd because mysql versions prior to 5.6 (not yet GA) have a ridiculously small host cache, which effectively means that it does a PTR lookup for every incoming connection. nscd has it's own problems though: in CentOS 4 for instance, nscd was rather buggy so we had to make it restart itself every hour. Yes, that very same hour as the interval between problems.
So what happens if nscd restarts? You fall back to nss_dns. When it does that, these PTR queries go to unreachable IP addresses, timing out after 5 seconds. But because our application times out after 2, this wasn't really visible. The application only saw that MySQL failed to respond promptly.
So why was falling back to nss_dns bad? And why wasn't this a problem until a few weeks ago? Ironically, this was caused by switching to highly available nameservers. We strongly believe that we need to be able to take any machine out for maintenance, including reboots. So for DNS, we switched to using clusters of machines using pacemaker/corosync. And since turning a "normal" address into a highly available vip with no downtime is nigh on impossible, many DNS server IP addresses had to change. A lot of services on each box were restarted, but mysql didn't seem to need that.
So there's the missing piece of the puzzle made from bugs:
- Use nscd to work around a mysql limitation
- Restart nscd regularly to work around an nscd bug
- Switch to HA clusters for DNS servers, fixing a bug in our infrastructure
- Forget glibc's broken resolver behaviour as it's not really visible
Perfect storm of bugs, nasty to track down. Long story short: if you change a dns server's IP, you better reboot your datacenter. In other words, don't change DNS server IP's.
It's never lupus
Except when it is. Like House never accepting a diagnosis of lupus, network engineers and sysadmins have a tendency to not believe there are network and dns problems. This is actually very reasonable, given that most problems that appear as network or dns problems are caused by misbehaving applications, causing problems that look like network or dns problems. But there's always the one episode where you finally have a case of lupus. Or do you? Is this really a DNS problem...?