Worst Day of a DBAs Life

Home / Professional Development / Worst Day of a DBAs Life

5worst_1_300x250_NEWRed Gate Software is running a campaign around coping with the worst day of a DBAs life. We’ve been posting some really fun stories with, I hope, a kernel of useful information inside each. Chances are, if your DBA career has been like mine, your worst days don’t involve explosions and car chases. But now they’re asking for people to write up stories, what was the worst day in your life as a DBA. I know, I know, first world problems, right? Regardless, I have yet to put a company out of business or kill anyone with any errors I’ve made, although I’ve worked at places where either was possible. But the one day that just stands out, well it started about three weeks ahead of the bad day.

I was working for an internet startup. I was very much still learning the ropes as a DBA, although, I had helped design a system on SQL Server 7.0 that was collecting about 1gb of data a day. Back in those days, that was big data. But, frankly, I wasn’t doing the monitoring correctly. I was doing a lot of manual checks and manual maintenance, stuff I should have taken the time to automate. Live & learn, right? Anyway, because I was taking a lot of time out of each day to do maintenance and run checks on the systems, I wasn’t spending lots of time supporting the development teams. One day, one of the managers came in and said, “No more maintenance. Things should be fine. Spend time on development.” And he was serious. I argued and lost. So I started spending a lot of time doing development and let the maintenance slide. Fast forward three weeks, things had largely been stable, but, I didn’t have monitoring in place, so I wasn’t noticing that we were running a little hot on transactions. The transaction log was bigger than it had been. And then disaster struck. The backup drive filled. I didn’t notice. Transaction log backups started failing. I didn’t have alerts. The log drive filled, all the way, and our 24/7, zero downtime web site went kablooey.

It was at 2 in the afternoon or something, so I, and my fellow DBAs were right there. We immediately tried to backup the log, but it wouldn’t backup. We tried adding a log file. Didn’t work. Then we started getting creative. We tried every possible thing we could think of. Some of them failed quick, so we tried something else. Some of them took hours to fail, making us think we had fixed the problem. It took us 48 hours of failed attempts before we finally admitted defeat, went to the last good backup and restored the database, losing about 12 hours worth of transactions. It was a nightmare.

The good news was, our directive to stop doing maintenance was cleared. We immediately went about setting up alerts and other things so that we wouldn’t get surprised like that ever again. It wasn’t fun, but it was a great learning experience. Plus, all the troubleshooting for 48 hours straight provided excellent camaraderie within the team. That said, I’d rather have avoided the whole thing, and could have with proper monitoring and alerting.

5 Comments

  • I recently went back to the world of consulting. Lack of monitoring and correct maintenance plans is my daily to-do list for client fixes. Out of log space just happen to a client the other day for me. I guess we (uninformed IT) just keeps on making the same mistakes. I call it job security for a consultant!

  • cbgb

    Please tell us what happen afterwards… Were people fired? How pissed was management about the lose of 12 hours of transactions?

    Thanks

    • No one was fired. We, the DBAs learned some lessons from it and management certainly learned lessons from it. My main lesson was the knowledge that I needed to be more proactive about monitoring and automation. I think, but I’m not sure, the main lesson for management was that the DBA team actually did have an important job.

    • Bee Prest

      I do not agree with your nice spin on it, Grant. That idiot, your “boss”, who gave you that foolish instruction SHOULD have been held responsible. reprimanded or terminated. why wasn’t he reprimanded??? You people DID NOT need to lose 12 hours worth of transactions to know that the core purpose of having a Database ADMINISTRATOR (DBA), {not a database developer}, is to MONITOR IT mainly. That’s the core purpose of having a database ADMIN on duty. I know you’re trying to put a positive spin on it. Truth demands you learn your lesson from wisdom. THAT WAS FOOL YOU OBEYED. That was a fool you obeyed. It’s plain simple. No lesson learned here- just a story to share.

OK, fine, but what do you think?