For the first several years that I supported server environments, I spent most of my time working with backup systems. I noticed that almost everyone did their due diligence in performing backups. Most people took an adequate responsibility to verify that their scheduled backups ran without error. However, almost no one ever checked that they could actually restore from a backup — until disaster struck. I gathered a lot of sorrowful stories during those years. I want to use those experiences to help you avert a similar tragedy.
Successful Backups Do Not Guarantee Successful Restores
Fortunately, a lot of the problems that I dealt with in those days have almost disappeared due to technological advancements. But, that only means that you have better odds of a successful restore, not that you have a zero chance of failure. Restore failures typically mean that something unexpected happened to your backup media. Things that I’ve encountered:
- Staff inadvertently overwrote a full backup copy with an incremental or differential backup
- No one retained the necessary decryption information
- Media was lost or damaged
- Media degraded to uselessness
- Staff did not know how to perform a restore — sometimes with disastrous outcomes
I’m sure that some of you have your own horror stories.
These risks apply to all organizations. Sometimes we manage to convince ourselves that we have immunity to some or all of them, but you can’t get there without extra effort. Let’s break down some of these line items.
People Represent the Weakest Link
We would all like to believe that our staff will never make errors and that the people that need to operate the backup system have the ability to do so. However, as a part of your disaster recovery planning, you must expect an inability to predict the state or availability of any individual. If only a few people know how to use your backup application, then those people become part of your risk profile.
You have a few simple ways to address these concerns:
- Periodically test the restore process
- Document the restore process and keep the documentation updated
- Non-IT personnel need knowledge and practice with backup and restore operations
- Non-IT personnel need to know how to get help with the application
It’s reasonable to expect that you would call your backup vendor for help in the event of an emergency that prevented your best people from performing restores. However, in many organizations without a proper disaster recovery plan, no one outside of IT even knows who to call. The knowledge inside any company naturally tends to arrange itself in silos, but you must make sure to spread at least the bare minimum information.
Technology Does Fail
I remember many shock and horror reactions when a company owner learned that we could not read the data from their backup tapes. A few times, these turned into grief and loss counselling sessions as they realized that they were facing a critical — or even complete — data loss situation. Tape has its own particular risk profile, and lots of businesses have stopped using it in favour of on-premises disk-based storage or cloud-based solutions. However, all backup storage technologies present some kind of risk.
In my experience, data degradation occurred most frequently. You might see this called other things, my favourite being “bit rot”. Whatever you call it, it all means the same thing: the data currently on the media is not the same data that you recorded. That can happen just because magnetic storage devices have susceptibilities. That means that no one made any mistakes — the media just didn’t last. For all media types, we can establish an average for failure rates. But, we have absolutely no guarantees on the shelf life for any individual unit. I have seen data pull cleanly off decade-old media; I have seen week-old backups fail miserably.
Unexpectedly, newer technology can make things worse. In our race to cut costs, we frequently employ newer ways to save space and time. In the past, we had only compression and incremental/differential solutions. Now, we have tools that can deduplicate across several backup sets and at multiple levels. We often put a lot of reliance on the single copy of a bit.
How to Test your Backup Strategy
The best way to identify problems is to break-test to find weaknesses. Leveraging test restores will help identity backup reliability and help you solve these problems. Simply, you cannot know that you have a good backup unless you can perform a good restore. You cannot know that your staff can perform a restore unless they perform a restore. For maximum effect, you need to plan tests to occur on a regular basis.
Some tools, like Altaro VM Backup, have built-in tools to make tests easy. Altaro VM Backup provides a “Test & Verify Backups” wizard to help you perform on-demand tests and a “Schedule Test Drills” feature to help you automate the process.
If your tool does not have such a feature, you can still use it to make certain that your data will be there when you need it. It should have some way to restore a separate or redirected copy. So, instead of overwriting your live data, you can create a duplicate in another place where you can safely examine and verify it.
Test Restore Scenario
In the past, we would often simply restore some data files to a shared location and use a simple comparison tool. Now that we use virtual machines for so much, we can do a great deal more. I’ll show one example of a test that I use. In my system, all of these are Hyper-V VMs. You’ll have to adjust accordingly for other technologies.
Using your tool, restore copies of:
- A domain controller
- A SQL server
- A front-end server dependent on the SQL server
On the host that you restored those VMs to, create a private virtual switch. Connect each virtual machine to it. Spin up the copied domain controller, then the copied SQL server, then the copied front-end. Use the VM connect console to verify that all of them work as expected.
Create test restore scenarios of your own! Make sure that they match a real-world scenario that your organization would rely on after a disaster.
Go to Original Article
Author: Eric Siron