Red Gate Software is running a campaign around coping with the worst day of a DBAs life. We’ve been posting some really fun stories with, I hope, a kernel of useful information inside each. Chances are, if your DBA career has been like mine, your worst days don’t involve explosions and car chases. But now they’re asking for people to write up stories, what was the worst day in your life as a DBA. I know, I know, first world problems, right? Regardless, I have yet to put a company out of business or kill anyone with any errors I’ve made, although I’ve worked at places where either was possible. But the one day that just stands out, well it started about three weeks ahead of the bad day.
I was working for an internet startup. I was very much still learning the ropes as a DBA, although, I had helped design a system on SQL Server 7.0 that was collecting about 1gb of data a day. Back in those days, that was big data. But, frankly, I wasn’t doing the monitoring correctly. I was doing a lot of manual checks and manual maintenance, stuff I should have taken the time to automate. Live & learn, right? Anyway, because I was taking a lot of time out of each day to do maintenance and run checks on the systems, I wasn’t spending lots of time supporting the development teams. One day, one of the managers came in and said, “No more maintenance. Things should be fine. Spend time on development.” And he was serious. I argued and lost. So I started spending a lot of time doing development and let the maintenance slide. Fast forward three weeks, things had largely been stable, but, I didn’t have monitoring in place, so I wasn’t noticing that we were running a little hot on transactions. The transaction log was bigger than it had been. And then disaster struck. The backup drive filled. I didn’t notice. Transaction log backups started failing. I didn’t have alerts. The log drive filled, all the way, and our 24/7, zero downtime web site went kablooey.
It was at 2 in the afternoon or something, so I, and my fellow DBAs were right there. We immediately tried to backup the log, but it wouldn’t backup. We tried adding a log file. Didn’t work. Then we started getting creative. We tried every possible thing we could think of. Some of them failed quick, so we tried something else. Some of them took hours to fail, making us think we had fixed the problem. It took us 48 hours of failed attempts before we finally admitted defeat, went to the last good backup and restored the database, losing about 12 hours worth of transactions. It was a nightmare.
The good news was, our directive to stop doing maintenance was cleared. We immediately went about setting up alerts and other things so that we wouldn’t get surprised like that ever again. It wasn’t fun, but it was a great learning experience. Plus, all the troubleshooting for 48 hours straight provided excellent camaraderie within the team. That said, I’d rather have avoided the whole thing, and could have with proper monitoring and alerting.