For the last few weeks, we've seen a pretty constant but very small trickle in connection issues in our internal network. Application servers would complain about connections to databases timing out. Not all database servers were affected, but a fair chunk of them were, including some of the more important master databases. Nothing could explain these errors: no errors or packet loss were reported on the network side, no errors were seen on the MySQL side. So how do you debug this? This is how we (well, they. I wasn't involved personally) walked through it.
Graph everything
We have a metric buttload of graphs for our system. Even in imperial units it's a huge number. Most of them are stored in graphite, but we also have our homegrown monitoring system. The latter includes a set of graphs for application errors. We can group these various ways, but the nice one here was grouping them by exact error message. This told us that the problem happened on many machines, but more importantly: there was a pattern! The problem would happen for a few seconds each hour. Not on all machines at the same time (so, not a cronjob), but still: every hour.
tcpdump
Ok, so we can now predict when the problem will reoccur. This is great if you have no ideas, because to get an idea you need to collect data. Our database servers do an awful lot of traffic, and even if you manage to capture an error, finding it is ridiculously difficult if you can't narrow down your search window. So, we tcpdumped and caught one of the errors. Now the searching starts. What we found, was that during these problems there was some spurious DNS traffic going to unassigned IP addresses. The plot thickens...
DNS
Look at the post title. I gave away the solution didn't I? Not all of it though, because what follows now is a tale of hysterical raisins, paranoia, high availability and glibc.
It is reasonably well known that glibc's resolver (nss_dns) never reloads resolv.conf. This misfeature means that you need to use a local caching daemon, such as nscd or dnsmasq, to make sure DNS actually works for you if you don't use the same nameserver 100% of the time. Or you patch all your applications...
In a datacenter this usally is much less important than on a desktop or laptop, as nameservers rarely change. Though we still use nscd because mysql versions prior to 5.6 (not yet GA) have a ridiculously small host cache, which effectively means that it does a PTR lookup for every incoming connection. nscd has it's own problems though: in CentOS 4 for instance, nscd was rather buggy so we had to make it restart itself every hour. Yes, that very same hour as the interval between problems.
So what happens if nscd restarts? You fall back to nss_dns. When it does that, these PTR queries go to unreachable IP addresses, timing out after 5 seconds. But because our application times out after 2, this wasn't really visible. The application only saw that MySQL failed to respond promptly.
High availibility
So why was falling back to nss_dns bad? And why wasn't this a problem until a few weeks ago? Ironically, this was caused by switching to highly available nameservers. We strongly believe that we need to be able to take any machine out for maintenance, including reboots. So for DNS, we switched to using clusters of machines using pacemaker/corosync. And since turning a "normal" address into a highly available vip with no downtime is nigh on impossible, many DNS server IP addresses had to change. A lot of services on each box were restarted, but mysql didn't seem to need that.
So there's the missing piece of the puzzle made from bugs:
- Use nscd to work around a mysql limitation
- Restart nscd regularly to work around an nscd bug
- Switch to HA clusters for DNS servers, fixing a bug in our infrastructure
- Forget glibc's broken resolver behaviour as it's not really visible
Perfect storm of bugs, nasty to track down. Long story short: if you change a dns server's IP, you better reboot your datacenter. In other words, don't change DNS server IP's.
It's never lupus
Except when it is. Like House never accepting a diagnosis of lupus, network engineers and sysadmins have a tendency to not believe there are network and dns problems. This is actually very reasonable, given that most problems that appear as network or dns problems are caused by misbehaving applications, causing problems that look like network or dns problems. But there's always the one episode where you finally have a case of lupus. Or do you? Is this really a DNS problem...?
henkjan on 10/16/2012 11:09 a.m. #
maybe add 'skip-name-resolv' to your mysql config files. (no reason to do al those lookups)
mysql> show variables like '%resolve%';
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| skip_name_resolve | ON |
+-------------------+-------+
sadig on 10/17/2012 10:15 a.m. #
As Henkjan said,
skip_name_resolve can sometimes be a performance boost.
And regarding the HA cluster:
I would go more for ipvs/ldirectord loadbalancing DNS servers, instead of pacemaker/corosync.
Reason:
Normally you have more requests to your DNS and you want them loadbalanced.
e.g. you run your DNS on VMs (which can be a good idea, because you can spread them across several HVs) so ipvs/ldirectord will balance them (rr i.e.). Furthermore, you add checks to ldirectord to ask directly the real DNS servers, and if this check is successful it will held this server in the loadbalancing.
Now, you want to update/upgrade one of the real DNS servers, you just bring down the DNS Service, and the check of ldirectord will fail, it takes this realserver out of the Loadbalancing, and you can do your update/upgrade without service interruption and one of the advantages, you don't have to deal with failover work via crm.
And another goodie, ipvs/ldirectord setups are more easily maintainable via puppet or chef ;)
Regards,
\sh
Dennis Kaarsemaker on 10/17/2012 9:37 p.m. #
skip_name_resolve only works if you have no grants with (parts of) hostnames in them, so it's not a very good solution.
lvs/ldirectord wouldn't have prevented this as we'd still have to change the IP addresses. As far as managability goes, crm has been fully puppetized here and all health checks and failovers are automatic.