Read more about Tim’s challenge here.
It’s very easy to think of SQL Server backups as a technical problem. You have so much stuff going on, BACKUP DATABASE commands, recovery models, BACKUP LOG commands, Differential backups. Getting them all into the correct order and automating the processes sure seems like a technical problem. It isn’t. It’s all about the business. If you’re taking on the duties of a DBA whether you’re an accidental DBA, a reluctant DBA or you were voluntold into the DBA position, you need to plan to sit down with responsible parties from the business and get an understanding with them regarding RPO and RTO.
RPO is a TLA for Recovery Point Objective. The easiest way to describe RPO is to ask, “In terms of time, how much data are we willing to lose?” The immediate answer is always going to be zero. Here is where we have to be honest. You won’t be able to guarantee zero data loss (yeah, there are probably ways to do this, but #entrylevel). Talk with the business. Most of the time, you’ll find that they’d actually be OK with 15 minutes, or maybe 5 minutes, or even an hour of lost data. It really varies, not only from business to business, but from database to database within the business (allow for this flexibility). You need to establish this number. RPO is going to help you figure out how to set up your backups, your recovery model, your logs and their backups. All that stuff that seemed so technical, it’s all based on this extremely important number that you’re going to work with the business to arrive at.
Oh, but we’re not done. Once you’ve managed to get the business comfortable (as comfortable as they can be) with the idea that they could lose data, you also have to prepared them for the idea that, in the event of a disaster, restoring the database from backups is not going to be instant. It’s going to take time. This is where we have to define the RTO or Recovery Time Objective. This is our goal for how quickly we can restore the database. RTO is not so much a negotiation with the business as it is an education for the business. You see, you can only restore so fast on your hardware. Further, the RESTORE DATABASE process is dependent on the size of the backups. Even further, it’s dependent on the types of restore operations we’re running and whether or not we use WITH RECOVERY in the RESTORE operations. You may have to test a few restores to get an idea how fast things are with your system. Regardless, the RTO has to be arrived at and agreed on. You may also have to readdress the RTO as the number, size and volume of your data changes over time. Be prepared for this as well.
With the RPO established, you can now decide on the recovery model. Let’s take an example. If the business says that they can afford a day of data loss, depending on the size of your database, you can put this database into SIMPLE recovery, run a full backup once a day and walk away a winner. Another example, the business decides that it could live with up to 15 minutes worth of data loss. Now you have to go to FULL recovery and you have to set up log backups in addition to your full backups. Then, you start to mix the RTO into the mix. Let’s say you’re outage was to occur at 8PM and you run your backups at Midnight. You now have to restore 20 hours worth of log backups. That can take a long time. So, in order to make the RTO as short as possible, you toss in a differential backup every day at noon. Now you’ll only ever have to restore 12 hours worth of backups, so you can define a rough RTO for the business.
These simplified, and somewhat simplistic, examples are just the start of the process of figuring out how best to do your backups. However, that’s the technical part of the problem. The fundamental definitions that you have to have in order to start solving this technical issue are business decisions that you must get your business people involved with. Define the RPO and RTO, then start defining your recovery strategy.