The Lone Coder Reflections for the Unsung Linux Saviours
by Ken O. Burtch
My Daily WTF: The MySQL Outage
"Mistakes are part of the dues one pays for a full life."
-- Sophia Loren as quoted at Quote Garden
This month's article was supposed to be on
test-driven development. But it's not ready yet. So here's a story
about architecture.
Being an IT architect is a tough job because you will
always be opposed by the developers. A coworker observed the other
day that "There are two kinds of architects. One kind that's paid and
does nothing. The other kind steps in where he doesn't belong and
blocks projects for no reason."
The perspective from the architects position is a
little different. He's there to bring his intelligence, experience
and his overall vision of a company's systems in to eliminate
problems that cross multiple teams.
When I was an architect, I spent a lot of
my time trying to get people to slow down. This was intentional.
I would bring representatives of four or five teams into a room and
say, "OK, explain the problem to me." By making the developers slow
down and work through the problem, it soon became apparent to everybody
where there was misunderstandings. In companies, people are often
rushing too fast. By making people take time, it built respect between
the teams and got to the root of problems.
But sometimes slowing down is the wrong thing to do,
and I made a bad mistake once.
I came in one day to find a database server had failed.
MySQL
uses reverse domain lookups when a connection logs in. When
you set up an account on MySQL, you have to specify both a username
and a hostname or IP number. During a login attempt, the database
engine needs to determine where you're logging in from in order to
determine which account you are using. So it does a reverse domain
lookup to get the IP number for the other computer.
A downside to the MySQL approach is that if the DNS
system fails, the database won't accept any connections. This is
exactly what happened during the previous night. When a system
administrator tried to reboot the database server, the server failed
because of memory problems so initially we weren't aware of
the DNS problem since the server wouldn't even start.
We had a backup database server for just such an
emergency, but promoting the slave--switching to the backup--had never
been done before. The system administrators were against it: there
was nobody at the data center and what if something even worse went
wrong? And there was a change freeze on because it was the winter
holidays.
This is where I made my mistake. I should have
said, "Switch to the stupid backup! That's what it's there for!
Our priority is to keep the site up for the users. If you're
concerned about your jobs, let's keep that in mind." As the
architect there, people looked to me for leadership. Instead,
with doubt and indecision and worrying about the wrong things,
I failed to say what I should have said. "I guess we can afford
to wait a few minutes for someone to arrive at the data center."
I sold out, trying to ease the minds of people worried about their
jobs instead of worrying about the customers not being able to
access the site.
I blew it and afterward I was ashamed of what
I did.
Minutes can drag into hours during a crisis.
After an hour-and-a-half, there was still no one at the data
center and the other architect
arrived. "Why is the site still down? Switch to the slave!
Heck, I'll do it myself. If anyone complains, send them to me
and I'll take the heat for it."
He was, of course, totally right.
When we switched to the backup, it turned out it
wouldn't work either and that exposed the DNS problem. Then we
could get to work on that problem instead of waiting for someone
to arrive at the data center to switch the memory on the main
database server.
Architects don't always make the right decision.
People are human. But we can choose if we learn from our mistakes
or keep repeating them. I was determined to learn from my
mistake and not repeat it in the next crisis situation.