[Navigation Bar]  
 
 
The Lone Coder
Reflections for the Unsung Linux Saviours
by Ken O. Burtch
 
 
[Lone Coder]

  My Daily WTF: The MySQL Outage

"Mistakes are part of the dues one pays for a full life."

 

-- Sophia Loren as quoted at Quote Garden

This month's article was supposed to be on test-driven development. But it's not ready yet. So here's a story about architecture.

Being an IT architect is a tough job because you will always be opposed by the developers. A coworker observed the other day that "There are two kinds of architects. One kind that's paid and does nothing. The other kind steps in where he doesn't belong and blocks projects for no reason."

The perspective from the architects position is a little different. He's there to bring his intelligence, experience and his overall vision of a company's systems in to eliminate problems that cross multiple teams.

When I was an architect, I spent a lot of my time trying to get people to slow down. This was intentional. I would bring representatives of four or five teams into a room and say, "OK, explain the problem to me." By making the developers slow down and work through the problem, it soon became apparent to everybody where there was misunderstandings. In companies, people are often rushing too fast. By making people take time, it built respect between the teams and got to the root of problems.

But sometimes slowing down is the wrong thing to do, and I made a bad mistake once.

I came in one day to find a database server had failed. MySQL uses reverse domain lookups when a connection logs in. When you set up an account on MySQL, you have to specify both a username and a hostname or IP number. During a login attempt, the database engine needs to determine where you're logging in from in order to determine which account you are using. So it does a reverse domain lookup to get the IP number for the other computer.

A downside to the MySQL approach is that if the DNS system fails, the database won't accept any connections. This is exactly what happened during the previous night. When a system administrator tried to reboot the database server, the server failed because of memory problems so initially we weren't aware of the DNS problem since the server wouldn't even start.

We had a backup database server for just such an emergency, but promoting the slave--switching to the backup--had never been done before. The system administrators were against it: there was nobody at the data center and what if something even worse went wrong? And there was a change freeze on because it was the winter holidays.

This is where I made my mistake. I should have said, "Switch to the stupid backup! That's what it's there for! Our priority is to keep the site up for the users. If you're concerned about your jobs, let's keep that in mind." As the architect there, people looked to me for leadership. Instead, with doubt and indecision and worrying about the wrong things, I failed to say what I should have said. "I guess we can afford to wait a few minutes for someone to arrive at the data center." I sold out, trying to ease the minds of people worried about their jobs instead of worrying about the customers not being able to access the site.

I blew it and afterward I was ashamed of what I did.

Minutes can drag into hours during a crisis. After an hour-and-a-half, there was still no one at the data center and the other architect arrived. "Why is the site still down? Switch to the slave! Heck, I'll do it myself. If anyone complains, send them to me and I'll take the heat for it."

He was, of course, totally right.

When we switched to the backup, it turned out it wouldn't work either and that exposed the DNS problem. Then we could get to work on that problem instead of waiting for someone to arrive at the data center to switch the memory on the main database server.

Architects don't always make the right decision. People are human. But we can choose if we learn from our mistakes or keep repeating them. I was determined to learn from my mistake and not repeat it in the next crisis situation.

November 20, 2010 

[Cafe] Comment [Link Opens New Window]

Talk back on the Linux Cafe

[RSS] Subscribe

Works with Firefox, Thunderbird or RSS viewers

Digg! Gotta Digg The Lone Coder /
Share at SlashDot [Link Opens New Window]

Recommend this Article

^ Back to the Top

Read More (Architecture):  A Server by Any Other Name --> 

Read More (by date):  What is an Agile Language? --> 

Read More:  The Lone Coder Home Page -->