Perhaps the only thing worse than having a disaster strike your datacenter is the stress of recovering your data and services as quickly as possible. Most businesses need to operate 24 hours a day and any service outage will upset customers and your business will lose money. According to a 2016 study by the Ponemon Institute, the average datacenter outage costs enterprises over $750,000 and lasts about 85 minutes, losing the businesses roughly $9,000 per minute. While your organization may be operating at a smaller scale, any service downtime or data loss is going to hurt your reputation and may even jeopardize your career. This blog is going to give you the best practices for how to recover your data from a backup and bring your services online as fast as possible.
Automation is key when it comes to decreasing your Recovery Time Objective (RTO) and minimizing your downtime. Any time you have a manual step in the process, it is going to create a bottleneck. If the outage is caused by a natural disaster, relying on human intervention is particularly risky as the datacenter may be inaccessible or remote connections may not be available. As you learn about the best practice of detection, alerting, recovery, startup, and verification, consider how you could implement each of these steps in a fully-automated fashion.
The first way to optimize your recovery speed is to detect the outage as quickly as possible. If you have an enterprise monitoring solution like System Center Operations Manager (SCOM), it will continually check the health of your application and its infrastructure, looking for errors or other problems. Even if you have developed an in-house application and do not have access to enterprise tools, you can use Windows Task Manager to set up tasks that automatically check for system health by scanning event logs, then trigger recovery actions. There are also many free monitoring tools such as Uptime Robot which alerts you anytime your website goes offline.
Once the administrators have been alerted, immediately begin the recovery process. Meanwhile, you should run a secondary health check on the system to make sure that you did not receive a false alert. This is a great background task to continually run during the recovery process to make sure that something like a cluster failover or transient network failure does not force your system into restarting if it is actually healthy. If the outage was indeed a false positive, then have a task prepared which will terminate the recovery process so that it does not interfere with the now-healthy system.
If you restore your service and determine that there was data loss, then you will need to make a decision whether to accept that loss or if you should attempt to recover from the last good backup, which can cause further downtime during the restoration. Make sure you can automatically determine whether you need to restore a full backup, or whether a differencing backup is sufficient to give you a faster recovery time. By comparing the timestamp of the outage to the timestamp on your backup(s), you can determine which option will minimize the impact on your business. This can be done with a simple PowerShell script, but make sure that you know how to get this information from your backup provider and pass it into your script.
Once you have identified the best backup, you then need to copy it to your production system as fast as possible. A lot of organizations will deprioritize their backup network since they are only used a few times a day or week. This may be acceptable during the backup process, but these networks need to be optimized during recovery. If you do need to restore a backup, consider running a script that will prioritize this traffic, such as by changing the quality of service (QoS) settings or disabling other traffic which uses that same network.
Next, consider the storage media which the backup is copied before the restoration happens. Try to use your fastest SSD disks to maximize the speed in which the backup is restored. If you decided to backup your data on a tape drive, you will likely have high copy speeds during restoration. However, tape drives usually require manual intervention to find and mount that drive, which should generally be avoided if you want a fully automated process. You can learn more about the tradeoffs of using tape drives and other media here.
Once your backup has been restored, then you need to restart the services and applications. If you are restoring to a virtual machine (VM), then you can optimize its startup time by maximizing the memory which is allocated to it during startup and operations. You can also configure VM prioritization to ensure that this critical VM starts first in case it is competing with other VMs to launch on a host which has recently crashed. Enable QoS on your virtual network adapters to ensure that traffic flows through to the guest operating system as quickly as possible, which will speed up the time to restore a backup within the VM, and also help clients reconnect faster. Whether you are running this application within a VM or on bare metal, you can also use Task Manager to enhance the priority of the important processes.
Now verify that your backup was restored correctly and your application is functioning as expected by running some quick test cases. If you feel confident that those tests worked, then you can allow users to reconnect. If those tests fail, then work backward through the workflow to try to determine the bottleneck, or simply roll back to the next “good” backup and try the process again.
Anytime you need to restore from a backup, it will be a frustrating experience, which is why testing throughout your application development lifecycle is critical. Any single point of failure can cause your backup or recovery to fail, which is why this needs to be part of your regular business operations. Once your systems have been restored, always make sure your IT department does a thorough investigation into what caused the outage, what worked well in the recovery, and what areas could be improved. Review the time each step took to complete and ask yourself whether any of these should be optimized. It is also a good best practice to write up a formal report which can be saved and referred to in the future, even if you have moved on to a different company.
You will focus a great deal of your disaster recovery planning (and rightly so) on the data that you need to capture. The best way to find out if your current strategy does this properly is to try our acid test. However, backup coverage only accounts for part of a proper overall plan. Your larger design must include a thorough model of recovery goals, specifically Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Ideally, a restore process would contain absolutely everything. Practically, expect that to never happen. This article explains the risks and options of when and how quickly operations can and should resume following systems failure.
If a catastrophe strikes that requires recovery from backup media, most people will first ask: “How long until we can get up and running?” That’s an important question, but not the only time-oriented problem that you face. Additionally, and perhaps more importantly, you must ask the question: “How much already-completed operational time can we afford to lose?” The business-continuity industry represents the answers to those question in the acronyms RTO and RPO, respectively.
What is Recovery Time Objective?
Your Recovery Time Objective(RTO) sets the expectation for the answer to, “How long until we can get going again?” Just break the words out into a longer sentence: “It is the objective for the amount of time between the data loss event and recovery.”
Of course, we would like to make all of our recovery times instant. But, we also know that will not happen. So, you need to decide in advance how much downtime you can tolerate, and strategize accordingly. Do not wait until the midst of a calamity to declare, “We need to get online NOW!” By that point, it will be too late. Your organization needs to build up those objectives in advance. Budgets and capabilities will define the boundaries of your plan. Before we investigate that further, let’s consider the other time-based recovery metric.
What is Recovery Point Objective?
We don’t just want to minimize the amount of time that we lose; we also want to minimize the amount of data that we lose. Often, we frame that in terms of retention policies — how far back in time we need to be able to access. However, failures usually cause a loss of systems during run time. Unless all of your systems continually duplicate data as it enters the system, you will lose something. Because backups generally operate on a timer of some sort, you can often describe that potential loss in a time unit, just as you can with recovery times. We refer to the maximum total acceptable amount of lost time as a Recovery Point Objective (RPO).
As with RTOs, shorter RPOs are better. The shorter the amount of time since a recovery point, the less overall data lost. Unfortunately, reduced RPOs take a heavier toll on resources. You will need to balance what you can achieve against what your business units want. Allow plenty of time for discussions on this subject.
Challenges Against Short RTOs and RPOs
First, you need to understand what will prevent you from achieving instant RTOs and RPOs. More importantly, you need to ensure that the critical stakeholders in your organization understand it. These objectives mean setting reasonable expectations for your managers and users at least as much as they mean setting goals for your IT staff.
We can define a handful of generic obstacles to quick recovery times:
Time to acquire, configure, and deploy replacement hardware
Effort and time to move into new buildings
Need to retrieve or connect to backup media and sources
You may also face some barriers specific to your organization, such as:
Involvement of key personnel
Make sure to clearly document all known conditions that add time to recovery efforts. They can help you to establish a recovery checklist. When someone requests a progress report during an outage, you can indicate the current point in the documentation. That will save you time and reduce frustration.
We could create a similar list for RPO challenges as we did for RTO challenges. Instead, we will use one sentence to summarize them all: “The backup frequency establishes the minimum RPO”. In order to take more frequent backups, you need a fast backup system with adequate amounts of storage. So, your ability to bring resources to bear on the problem directly impacts RPO length. You have a variety of solutions to choose from that can help.
Outlining Organizational Desires
Before expending much effort figuring out what you can do, find out what you must do. Unless you happen to run everything, you will need input from others. Start broadly with the same type of questions that we asked above: “How long can you tolerate downtime during recovery?” and “How far back from a catastrophic event can you re-enter data?” Explain RTOs and RPOs. Ensure that everyone understands that RPO means recent a loss of recent data, not long-term historical data.
These discussions may require a fair bit of time and multiple meetings. Suggest that managers work with their staff on what-if scenarios. They can even simulate operations without access to systems. For your part, you might need to discover the costs associated with solutions that can meet different RPO and RTO levels. You do not need to provide exact figures, but you should be ready and able to answer ballpark questions. You should also know the options available at different spend levels.
Considering the Availability and Impact of Solutions
To some degree, the amount that you spend controls the length of your RTOs and RPOs. That has limits; not all vendors provide the same value per dollar spent. But, some institutions set out to spend as close to nothing as possible on backup. While most backup software vendors do offer a free level of their product, none of them makes their best features available at no charge. Organizations that try to spend nothing on their backup software will have high RTOs and RPOs and may encounter unexpected barriers. Even if you find a free solution that does what you need, no one makes storage space and equipment available for free. You need to find a balance between cost and capability that your company can accept.
To help you understand your choices, we will consider different tiers of data protection.
Instant Data Replication
For the lowest RPO, only real-time replication will suffice. In real-time replication, every write to live storage is also written to backup storage. You can achieve this many ways, but the most reliable involve dedicated hardware. You will spend a lot, but you can reduce your RPO to effectively zero. Even a real-time replication system can drop active transactions, so never expect a complete shield against data loss.
Real-time replication systems have a very high associated cost. For the most reliable protection, they will need to span geography as well. If you just replicate to another room down the hall and a fire destroys the entire building, your replication system will not save you. So, you will need multiple locations, very high speed interconnects, and capable storage systems.
Short Interval Data Replication
If you can sustain a few minutes of lost information, then you usually find much lower price tags for short-interval replication technology. Unlike real-time replication, software can handle the load of delayed replication, so you will find more solutions. As an example, Altaro VM Backup offers Continuous Data Protection (CDP), which cuts your RPO to as low as five minutes.
As with instant replication, you want your short-interval replication to span geographic locations if possible. But, you might not need to spend as much on networking, as the delays in transmission give transfers more time to complete.
Ransomware Considerations for Replication
You always need to worry about data corruption in replication. Ransomware adds a new twist but presents the same basic problem. Something damages your real-time data. None-the-wiser, your replication system makes a faithful copy of that corrupted data. The corruption or ransomware has turned both your live data and your replicated data into useless jumbles of bits.
Anti-malware and safe computing practices present your strongest front-line protection against ransomware. However, you cannot rely on them alone. The upshot: you cannot rely on replication systems alone for backup. A secondary implication: even though replication provides very short RPOs, you cannot guarantee them.
Short Interval Backup
You can use most traditional backup software in short intervals. Sometimes, those intervals can be just, or nearly, as short as short-term replication intervals. The real difference between replication and backup is the number of possible copies of duplicated data. Replication usually provides only one copy of live data — perhaps two or three at the most — and no historical copies. Backup programs differ in how many unique simultaneous copies that they will make, but all will make multiple historical copies. Even better, historical copies can usually exist offline.
You do not need to set a goal of only a few minutes for short interval backups. To balance protection and costs, you might space them out in terms of hours. You can also leverage delta, incremental, and differential backups to reduce total space usage. Sometimes, your technologies have built-in solutions that can help. As an example, SQL administrators commonly use transaction log backups on a short rotation to make short backups to a local disk. They perform a full backup each night that their regular backup system captures. If a failure occurs during the day that does not wipe out storage, they can restore the previous night’s full backup and replay the available transaction log backups.
Long Interval Backup
At the “lowest” tier, we find the oldest solution: the reliable nightly backup. This usually costs the least in terms of software licenses and hardware. Perhaps counter-intuitively, it also provides the most resilient solution. With longer intervals, you also get longer-term storage choices. You get three major benefits from these backups: historical data preservation, protection against data corruption, and offline storage. We will explore each in the upcoming sections.
Ransomware Considerations for Backup
Because we use a backup to create distinct copies, it has some built-in protection against data corruption, including ransomware. As long as the ransomware has no access to a backup copy, it cannot corrupt that copy. First and foremost, that means that you need to maintain offline backups. Replication requires essentially constant continuity to its replicas, so only backup can work under this restriction. Second, it means that you need to exercise caution around restores when you execute restore procedures. Some ransomware authors have made their malware aware of several common backup applications, and they will hijack it to corrupt backups whenever possible. You can only protect your offline data copies by attaching them to known-safe systems.
Using Multiple RTOs and RPOs
You will need to structure your systems into multiple RTO and RPO categories. Some outages will not require much time to recover from. Some will require different solutions. For instance, even though we tend to think primarily in terms of data during disaster recovery planning, you must consider equipment as well. For instance, if your sales division prints its own monthly flyers and you lose a printer, then you need to establish, RTOs, RPOs, downtime procedures, and recovery processes just for those print devices.
You also need to establish multiple levels for your data, especially when you have multiple protection systems. For example, if you have both replication and backup technologies in operation, then you will set one RPO/RTO value for times when the replication works, and RTO/RPO values for when you must resort to long-term backup. That could happen due to ransomware or some other data corruption event, but it can also happen if someone accidentally deletes something important.
To start this planning, establish “Best Case” and “Worst Case” plans and processes for your individual systems.
Leveraging Rotation and Retention Policies
For your final exercise in time-based disaster recovery designs, we will look at rotation and retention policies. “Rotation” comes from the days of tape backups, when we would decide how often to overwrite old copies of data. Now that high-capacity external disks have reached a low-cost point, many businesses have moved away from tape. You may not overwrite media anymore, or at least not at the same frequency. Retention policies dictate how long you must retain at least one copy of a given piece of information. These two policies directly relate to each other.
In today’s terms, think of “rotation” more in terms of unique copies of data. Backup systems have used “differential” and “incremental” backups for a very long time. The former is a complete record of changes since the last full backup; the latter is a record of changes since the last backup of any kind. Newer backup copies have “delta” and deduplication capabilities. A “delta” backup operates like a differential or incremental backup, but within files or blocks. Deduplication keeps only one copy of a block of bits, regardless of how many times it appears within an entire backup set. These technologies reduce backup time and storage space needs… at a cost.
Minimizing Rotation Risks
All of these speed-enhancing and space-reducing improvements have one major cost: they reduce the total number of available unique backup copies. As long as nothing goes wrong with your media, then this will never cause you a problem. However, if one of the full backups suffer damage, then that invalidates all dependent partial backups. You must balance the number of full backups that you take against the amount of time and bandwidth necessary to capture them.
As one minimizing strategy, target your full backup operations to occur during your organization’s quietest periods. If you do not operate 24 hours per day, that might allow for nightly full backups. If you have low volume weekends, you might take full backups on Saturdays or Sundays. You can intersperse full backups on holidays.
Coalescing into a Disaster Recovery Plan
As you design your disaster recovery plan, review the sections in this article as necessary. Remember that all operations require time, equipment, and personnel. Faster backup and restore operations always require a trade-off of expense and/or resilience. Modest lengthening of allowable RTOs and RPOs can result in major cost and effort savings. Make certain that the key members of your organization understand how all of these numbers will impact them and their operations during an outage.
If you need some help defining RTO and RPO in your organization, let me know in the comments section below and I will help you out!
Active Directory is the bedrock of most Windows environments, so it’s best to be prepared if disaster strikes.
AD is an essential component in most organizations. You should monitor and maintain AD, such as clear out user and computer accounts you no longer need. With routine care, AD will run properly, but unforeseen issues can arise. There are a few common Active Directory recovery procedures you can follow using out-of-the-box technology.
Loss of a domain controller
Many administrators see losing a domain controller as a huge disaster, but the Active Directory recovery effort is relatively simple — unless your AD was not properly designed and configured. You should never rely on a single domain controller in your domain, and large sites should have multiple domain controllers. Correctly configured site links will keep authentication and authorization working even if the site loses its domain controller.
You have two possible approaches to resolve the loss of a domain controller. The first option is to try to recover the domain controller and bring it back into service. The second option is to replace the domain controller. I recommend adopting the second approach, which requires the following actions:
Transfer or seize any flexible single master operation roles to an active domain controller. If you seize the role, then you must ensure that the old role holder is never brought back into service.
Remove the old domain controller’s account from AD. This will also remove any metadata associated with the domain controller.
Build a new server, join to the domain, install AD Directory Services and promote to a domain controller.
Allow replication to repopulate the AD data.
How to protect AD data
Protecting data can go a long way to make an Active Directory recovery less of a problem. There are a number of ways to protect AD data. These techniques, by themselves, might not be sufficient. But, when you combine them, they provide a defense in depth that should enable you to overcome most, if not all, disasters.
First, enable accidental deletion protection on all of your organizational units(OUs), as well as user and computer accounts. This won’t stop administrators from removing an account, but they will get warned and might prevent an accident.
Recover accounts from the AD recycle bin
Another way to avoid trouble is to enable the AD recycle bin. This is an optional feature used to restore a deleted object.
Enable-ADOptionalFeature -Identity 'Recycle Bin Feature' -Scope ForestOrConfigurationSet `-Target sphinx.org -Confirm:$false
After installing the feature, you may need to enable it through AD Administrative Center. Once added, you can’t uninstall the recycle bin.
Let’s run through a scenario where a user, whose properties are shown in the screenshot below, has been deleted.
To check for deleted user accounts, run a search in the recycle bin:
Press Ctrl-C in the console in which you ran Dsamain, and then unmount the snapshot:
ntdsutil snapshot "unmount *" quit quit
Run an authoritative restore from a backup
In the last scenario, imagine you lost a whole OU’s worth of data, including the OU. You could do an Active Directory recovery using data from the recycle bin, but that would mean restoring the OU and any OUs it contained. You would then have to restore each individual user account. This could be a tedious and error-prone process if the data in the user accounts in the OU changes frequently. The solution is to perform an authoritative restore.
Before you can perform a restore, you need a backup. We’ll use Windows Server Backup because it is readily available. Run the following PowerShell command to install:
In this example, let’s say an OU called Test with some critical user accounts got deleted.
Reboot the domain controller in which you’ve performed the backup, and go into Directory Services Recovery Mode. If your domain controller is a VM, you may need to use Msconfig to set the boot option rather than using the F8 key to get to the boot options menu.
Restart the domain controller. Use Msconfig before the reboot to reset to a normal start.
The OU will be restored on your domain controller and will replicate to the other domain controllers in AD.
A complete loss of AD requires intervention
In the unlikely event of losing your entire AD forest, you’ll need to work through the AD forest recovery guide at this link. If you have a support agreement with Microsoft, then this would be the ideal time to use it.
Datrium plans to open its new cloud disaster recovery as a service to any VMware vSphere users in 2020, even if they’re not customers of Datrium’s DVX infrastructure software.
Datrium released disaster recovery as a service with VMware Cloud on AWS in September for DVX customers as an alternative to potentially costly professional services or a secondary physical site. DRaaS enables DVX users to spin up protected virtual machines (VMs) on demand in VMware Cloud on AWS in the event of a disaster. Datrium takes care of all of the ordering, billing and support for the cloud DR.
In the first quarter, Datrium plans to add a new Datrium DRaaS Connect for VMware users who deploy vSphere infrastructure on premises and do not use Datrium storage. Datrium DraaS Connect software would deduplicate, compress and encrypt vSphere snapshots and replicate them to Amazon S3 object storage for cloud DR. Users could set backup policies and categorize VMs into protection groups, setting different service-level agreements for each one, Datrium CTO Sazzala Reddy said.
A second Datrium DRaaS Connect offering will enable VMware Cloud users to automatically fail over workloads from one AWS Availability Zone (AZ) to another if an Amazon AZ goes down. Datrium stores deduplicated vSphere snapshots on Amazon S3, and the snapshots replicated to three AZs by default, Datrium chief product officer Brian Biles said.
Speedy cloud DR
Datrium claims system recovery can happen on VMware Cloud within minutes from the snapshots stored in Amazon S3, because it requires no conversion from a different virtual machine or cloud format. Unlike some backup products, Datrium does not convert VMs from VMware’s format to Amazon’s format and can boot VMs directly from the Amazon data store.
“The challenge with a backup-only product is that it takes days if you want to rehydrate the data and copy the data into a primary storage system,” Reddy said.
Although the “instant RTO” that Datrium claims to provide may not be important to all VMware users, reducing recovery time is generally a high priority, especially to combat ransomware attacks. Datrium commissioned a third party to conduct a survey of 395 IT professionals, and about half said they experienced a DR event in the last 24 months. Ransomware was the leading cause, hitting 36% of those who reported a DR event, followed by power outages (26%).
The Orange County Transportation Authority (OCTA) information systems department spent a weekend recovering from a zero-day malware exploit that hit nearly three years ago on a Thursday afternoon. The malware came in through a contractor’s VPN connection and took out more than 85 servers, according to Michael Beerer, a senior section manager for online system and network administration of OCTA’s information systems department.
Beerer said the information systems team restored critical applications by Friday evening and the rest by Sunday afternoon. But OCTA now wants to recover more quickly if a disaster should happen again, he said.
OCTA is now building out a new data center with Datrium DVX storage for its VMware VMs and possibly Red Hat KVM in the future. Beerer said DVX provides an edge in performance and cost over alternatives he considered. Because DVX disaggregates storage and compute nodes, OCTA can increase storage capacity without having to also add compute resources, he said.
Datrium cloud DR advantages
Beerer said the addition of Datrium DRaaS would make sense because OCTA can manage it from the same DVX interface. Datrium’s deduplication, compression and transmission of only changed data blocks would also eliminate the need for a pricy “big, fat pipe” and reduce cloud storage requirements and costs over other options, he said. Plus, Datrium facilitates application consistency by grouping applications into one service and taking backups at similar times before moving data to the cloud, Beerer said.
Datrium’s “Instant RTO” is not critical for OCTA. Beerer said anything that can speed the recovery process is interesting, but users also need to weigh that benefit against any potential additional costs for storage and bandwidth.
“There are customers where a second or two of downtime can mean thousands of dollars. We’re not in that situation. We’re not a financial company,” Beerer said. He noted that OCTA would need to get critical servers up and running in less than 24 hours.
Reddy said Datrium offers two cost models: a low-cost option with a 60-minute window and a “slightly more expensive” option in which at least a few VMware servers are always on standby.
Pricing for Datrium DRaaS starts at $23,000 per year, with support for 100 hours of VMware Cloud on-demand hosts for testing, 5 TB of S3 capacity for deduplicated and encrypted snapshots, and up to 1 TB per year of cloud egress. Pricing was unavailable for the upcoming DRaaS Connect options.
Other cloud DR options
Jeff Kato, a senior storage analyst at Taneja Group, said the new Datrium options would open up to all VMware customers a low-cost DRaaS offering that requires no capital expense. He said most vendors that offer DR from their on-premises systems to the cloud force customers to buy their primary storage.
George Crump, president and founder of Storage Switzerland, said data protection vendors such as Commvault, Druva, Veeam, Veritas and Zerto also can do some form of recovery in the cloud, but it’s “not as seamless as you might want it to be.”
“Datrium has gone so far as to converge primary storage with data protection and backup software,” Crump said. “They have a very good automation engine that allows customers to essentially draw their disaster recovery plan. They use VMware Cloud on Amazon, so the customer doesn’t have to go through any conversion process. And they’ve solved the riddle of: ‘How do you store data in S3 but recover on high-performance storage?’ “
Scott Sinclair, a senior analyst at Enterprise Strategy Group, said using cloud resources for backup and DR often means either expensive, high-performance storage or lower cost S3 storage that requires a time-consuming migration to get data out of it.
“The Datrium architecture is really interesting because of how they’re able to essentially still let you use the lower cost tier but make the storage seem very high performance once you start populating it,” Sinclair said.
The rise of ransomware has had a significant effect on modern disaster recovery, shaping the way we protect data and plan a recovery. It does not bring the same physical destruction of a natural disaster, but the effects within an organization — and on its reputation — can be lasting.
It’s no wonder that recovering from ransomware has become such a priority in recent years.
It’s hard to imagine a time when ransomware wasn’t a threat, but while cyberattacks date back as far as the late 1980s, ransomware in particular has had a relatively recent rise in prominence. Ransomware is a type of malware attack that can be carried out in a number of ways, but generally the “ransom” part of the name comes from one of the ways attackers hope to profit from it. The victim’s data is locked, often behind encryption, and held for ransom until the attacker is paid. Assuming the attacker is telling the truth, the data will be decrypted and returned. Again, this assumes that the anonymous person or group that just stole your data is being honest.
“Just pay the ransom” is rarely the first piece of advice an expert will offer. Not only do you not know if payment will actually result in your computer being unlocked, but developments in backup and recovery have made recovering from ransomware without paying the attacker possible. While this method of cyberattack seems specially designed to make victims panic and pay up, doing so does not guarantee you’ll get your data back or won’t be asked for more money.
Disaster recovery has changed significantly in the 20 years TechTarget has been covering technology news, but the rapid rise of ransomware to the top of the potential disaster pyramid is one of the more remarkable changes to occur. According to a U.S. government report, by 2016 4,000 ransomware attacks were occurring daily. This was a 300% increase over the previous year. Ransomware recovery has changed the disaster recovery model, and it won’t be going away any time soon. In this brief retrospective, take a look back at the major attacks that made headlines, evolving advice and warnings regarding ransomware, and how organizations are fighting back.
In the news
The appropriately named WannaCry ransomware attack began spreading in May 2017, using an exploit leaked from the National Security Agency targeting Windows computers. WannaCry is a worm, which means that it can spread without participation from the victims, unlike phishing attacks, which require action from the recipient to spread widely.
How big was the WannaCry attack? Affecting computers in as many as 150 countries, WannaCry is estimated to have caused hundreds of millions of dollars in damages. According to cyber risk modeling company Cyence, the total costs associated with the attack could be as high as $4 billion.
Rather than the price of the ransom itself, the biggest issue companies face is the cost of being down. Because so many organizations were infected with the WannaCry virus, news spread that those who paid the ransom were never given the decryption key, so most victims did not pay. However, many took a financial hit from the downtime the attack caused. Another major attack in 2017, NotPetya, cost Danish shipping giant A.P. Moller-Maersk hundreds of millions of dollars. And that’s just one victim.
In 2018, the city of Atlanta’s recovery from ransomware ended up costing more than $5 million, and shut down several city departments for five days. In the Matanuska-Susitna borough of Alaska in 2018, 120 of 150 servers were affected by ransomware, and the government workers resorted to using typewriters to stay operational. Whether it is on a global or local scale, the consequences of ransomware are clear.
Taking center stage
Looking back, the massive increase in ransomware attacks between 2015 and 2016 signaled when ransomware really began to take its place at the head of the data threat pack. Experts not only began emphasizing the importance of backup and data protection against attacks, but planning for future potential recoveries. Depending on your DR strategy, recovering from ransomware could fit into your current plan, or you might have to start considering an overhaul.
By 2017, the ransomware threat was impossible to ignore. According to a 2018 Verizon Data Breach Report, 39% of malware attacks carried out in 2017 were ransomware, and ransomware had soared from being the fifth most common type of malware to number one.
Ransomware was not only becoming more prominent, but more sophisticated as well. Best practices for DR highlighted preparation for ransomware, and an emphasis on IT resiliency entered backup and recovery discussions. Protecting against ransomware became less about wondering what would happen if your organization was attacked, and more about what you would do when your organization was attacked. Ransomware recovery planning wasn’t just a good idea, it was a priority.
As a result of the recent epidemic, more organizations appear to be considering disaster recovery planning in general. As unthinkable as it may seem, many organizations have been reluctant to invest in disaster recovery, viewing it as something they might need eventually. This mindset is dangerous, and results in many companies not having a recovery plan in place until it’s too late.
While ransomware attacks may feel like an inevitability — which is how companies should prepare — that doesn’t mean the end is nigh. Recovering from ransomware is possible, and with the right amount of preparation and help, it can be done.
The modern backup market is evolving in such a way that downtime is considered practically unacceptable, which bodes well for ransomware recovery. Having frequent backups available is a major element of recovering, and taking advantage of vendor offerings can give you a boost when it comes to frequent, secure backups.
Vendors such as Reduxio, Nasuni and Carbonite have developed tools aimed at ransomware recovery, and can have you back up and running without significant data loss within hours. Whether the trick is backdating, snapshots, cloud-based backup and recovery, or server-level restores, numerous tools out there can help with recovery efforts. Other vendors working in this space include Acronis, Asigra, Barracuda, Commvault, Datto, Infrascale, Quorum, Unitrends and Zerto.
Along with a wider array of tech options, more information about ransomware is available than in the past. This is particularly helpful with ransomware attacks, because the attacks in part rely on the victims unwittingly participating. Whether you’re looking for tips on protecting against attacks or recovering after the fact, a wealth of information is available.
The widespread nature of ransomware is alarming, but also provides first-hand accounts of what happened and what was done to recover after the attack. You may not know when ransomware is going to strike, but recovery is no longer a mystery.
Meeting an organization’s disaster recovery challenges requires addressing problems from several angles based on specific recovery point and recovery time objectives. Today’s tight RTO and RPO expectations mean almost no data gets lost and no downtime.
To meet those expectations, businesses must move beyond backup and consider a data replication strategy. Modern replication products offer more than just a rapid disaster recovery copy of data, though. They can help with cloud migration, using the cloud as a DR site and even solving copy data challenges.
Replication software comes in two forms. One is integrated into a storage system, and the other is bought separately. Both have their strengths and weaknesses.
An integrated data replication strategy
The integrated form of replication has a few advantages. It’s often bundled at no charge or is relatively inexpensive. Of course, nothing in life is really free. The customer pays extra for the storage hardware in order to get the “free” software. In addition, at-scale, storage-based replication is relatively easy to manage. Most storage system replication works at a volume level, so one job replicates the entire volume, even if there are a thousand virtual machines on it. And finally, storage system-based replication is often backup-controlled, meaning the replication job can be integrated and managed by backup software.
There are, however, problems with a storage system-based data replication strategy. First, it’s specific to that storage system. Consequently, since most data centers use multiple storage systems from different vendors, they must also manage multiple replication products. Second, the advantage of replicating entire volumes can be a disadvantage, because some data centers may not want to replicate every application on a volume. Third, most storage system replication inadequately supports the cloud.
IT typically installs stand-alone replication software on each host it’s protecting or implements it into the cluster in a hypervisor environment. Flexibility is among software-based replication’s advantages. The same software can replicate from any hardware platform to any other hardware platform, letting IT mix and match source and target storage devices. The second advantage is that software-based replication can be more granular about what’s replicated and how frequently replication occurs. And the third advantage is that most software-based replication offers excellent cloud support.
At a minimum, the cloud is used as a DR target for data, but it’s also used as an entire disaster recovery site, not just a copy. This means there can be instantiate virtual machines, using cloud compute in addition to cloud storage. Some approaches go further with cloud support, allowing replication across multiple clouds or from the cloud back to the original data center.
The primary downside of a stand-alone data replication strategy is it must be purchased, because it isn’t bundled with storage hardware. Its granularity also means dozens, if not hundreds of jobs, must be managed, although several stand-alone data replication products have added the ability to group jobs by type. Finally, there isn’t wide support from backup software vendors for these products, so any integration is a manual process, requiring custom scripts.
Modern replication features
Modern replication software should support the cloud and support it well. This requirement draws a line of suspicion around storage systems with built-in replication, because cloud support is generally so weak. Replication software should have the ability to replicate data to any cloud and use that cloud to keep a DR copy of that data. It should also let IT start up application instances in the cloud, potentially completely replacing an organization’s DR site. Last, the software should support multi-cloud replication to ensure both on-premises and cloud-based applications are protected.
Another feature to look for in modern replication is integration into data protection software. This capability can take two forms: The software can manage the replication process on the storage system, or the data protection software could provide replication. Several leading data protection products can manage snapshots and replication functions on other vendors’ storage systems. Doing so eliminates some of the concern around running several different storage system replication products.
Data protection software that integrates replication can either be traditional backup software with an added replication function or traditional replication software with a file history capability, potentially eliminating the need for backup software. It’s important for IT to make sure the capabilities of any combined product meets all backup and replication needs.
How to make the replication decision
The increased expectation of rapid recovery with almost no data loss is something everyone in IT will have to address. While backup software has improved significantly, tight RPOs and RTOs mean most organizations will need replication as well. The pros and cons of both an integrated and stand-alone data replication strategy hinge on the environment in which they’re deployed.
Each IT shop must decide which type of replication best meets its current needs. At the same time, IT planners must figure out how that new data replication product will integrate with existing storage hardware and future initiatives like the cloud.
When you open a large public facility right on the water in Miami, a good disaster recovery setup is an essential task for an IT team. Hurricane Irma’s assault on Florida in September 2017 made that clear to the Phillip and Patricia Frost Museum of Science team.
The expected Category 5 hurricane moving in on Florida had the new Frost Science Museum square in its sights. Irma turned out to be less threatening to Miami than feared, and the then-4-month-old building suffered no major damage. Still, the museum’s vice president of technology said he felt prepared for the worst with his IT DR planning.
When preparing to open the museum on a 250,000-square-foot location on the Miami waterfront, technology chief Brooks Weisblat installed a new Dell EMC SAN in a fully redundant data center and set up a colocation site in Atlanta as part of its disaster recovery plan. The downgraded Category 4 hurricane dumped water into the building, but did no serious damage and caused no downtime.
The new Frost Science Museum building features three diesel generators and redundant power, including 20 minutes of backup power in the battery room that should provide enough juice until the backup generators come online. While much of southern Florida lost power during Irma, the museum did not.
“We’re sitting right on the water. It was supposed to be a major hurricane coming straight through Miami. But six hours before hitting, it veered off, so it wasn’t a direct hit,” Weisblat said. “We have two weather stations on the building, and we recorded force winds of 90 to 95 miles per hour. It could have been 190 mile-per-hour winds, and that would have been a different story.”
Advance warning of the hurricane prompted the museum’s team to bolster its IT DR planning.
“The hurricane moved us to get all of our backups in order,” Weisblat said. “Opening the building was intensive. We had backups internally, but we didn’t have off-site backups yet. It pushed us to get a colocated data center in Atlanta when the hurricane warnings came about a week before. At least we had a lot of advance notice for this one. Except for some water here and there, the museum did well.”
The Frost Science Museum raised $330 million in funding to build the new center in downtown Miami, closing its Coconut Grove site in August 2015. Museum organizers said they hoped to attract 750,000 visitors in the first year at the new site. From its May opening through Oct. 31, more than 525,000 people visited the museum.
Shifting to SAN, all-flash
When moving, Frost Science installed a dual-controller Dell EMC SC9000 — formerly Compellent — all-flash array, with 112 TB of capacity connected to 10 Dell EMC PowerEdge servers virtualized with VMware. As part of its IT DR planning, the museum uses Veeam Software to back up virtual machines to a Dell PowerEdge R530 server, with 40 TB of hard disk drive storage on site, and it replicates those backups to another PowerEdge server in the Atlanta location.
Brooks Weisblatvice president of technology, Frost Science Museum
“If something happens at this site, we’re able to launch a limited number of VMs to power finance, ticketing and reporting,” Weisblat said. “We can control those servers out of Atlanta if we’re unable to get into the building.”
Before opening the new building, Weisblat’s team migrated all VMs between the old and new sites. The process took three weeks. “We had to take down services, copy them to drives a few miles away, then bring those into the new environment and do an import into a new VM cluster,” he said.
The data center sits on the third floor of the new building, 60 feet above sea level. It takes up 16 full cabinets, plus eight racks for networking, Weisblat said.
Frost Science Museum had no SAN in the old building. Its IT ran on 23 servers. Weisblat said he migrated the stand-alone servers into the VMware cluster on the Compellent array before moving. “That way, when the new system came online, it would be easy to move those servers over as files, and we would not have to do migrations into VMware in the new building during the crush time for our opening,” he said.
The Dell EMC SAN runs all critical applications, including the customer relationship management system, exhibit content management, property management system software, the museum website, online ticketing and building security management systems. The security system controls electricity, lights, solar power, centralized antivirus deployments and network access control. “Everything is powered off this one system,” Weisblat said.
The SAN has two Brocade — now Broadcom — Fibre Channel switches for redundancy. “We can unplug hosts; everything keeps running,” Weisblat said. “We can unplug one of the storage arrays, and everything keeps running. The top-of-rack 10-gig [Extreme Avaya Ethernet] switches are also fully redundant. We can lose one of those.”
He said since installing the new array, one solid-state drive went out. “The SSD sent us an alert, and Dell had parts to us in two hours. Before I knew something was wrong, they contacted me.”
Whether it’s a failed SSD or an impending hurricane, early alerts and IT DR planning certainly help when dealing with disasters.
Say the word “disaster” and what comes to mind? An earthquake, a drought, a flood, a tsunami, a hurricane? These are big and brutish events. They grab headlines, inspire people to donate, and trigger international relief efforts.
But what about the many micro-disasters that can, at any time, befall poor families across the developing world? For those who live on a perpetual economic knife edge, even a small misfortune or an unexpected turn of events can devastate their hopes and dreams.
Let’s turn to Thimi, a tiny village in the ancient valley of Bhaktapur in Nepal – a nation that sits in the shadow of the Himalayas and is among the world’s poorest. An overwhelming majority of its 30 million people rely on farming to subsist – often on fragmented, hilly and marginal land where weather and other conditions are subject to extremes. In this rural society, a family typically measures its wealth in the number of animals it keeps.
For years, Rajesh Ghimire and his wife, Sharadha, worked hard to build up a modest herd of 45 cows, goats, and buffaloes. The farm was generating enough income to raise their two children, support four other relatives, and even pay six workers to help out. The Ghimeres had their eyes fixed on better times ahead, and were saving to send their daughter, Ekta, to medical school.
Then, their own micro-disaster struck. A series of heatwaves triggered an outbreak of the disease, anthrax. Almost half of their animals were wiped out and, with that, most of their dreams. The money that had been put away for Ekta’s studies had to be used to save the farm. Seven years later, the family is still trying to claw back what it lost.
Whether it’s responding to a natural disaster or helping a developing country improve its education system or water quality, international development company Chemonics needs to build out specialized business processes on the fly. That’s how it keeps more than 60 humanitarian projects around the world moving, despite each one having its own technological needs that are dependent on size, scope and location.
Roughly three years ago, the Washington, D.C.-based company began looking at business applications that could simplify the HR process of finding and hiring the necessary talent needed for their distinct projects, ultimately settling on Microsoft’s Dynamics 365 last October. But Chemonics was still longing for more HR capabilities, like onboarding and contract management, and it was looking at third-party tools to help fill the holes when Microsoft told them about a new feature coming down the line: Dynamics 365 for Talent.
The new Dynamics 365 feature, which was made generally available on Aug. 1, helps streamline routine tasks and automates staffing processes.
“Essentially, we build a brand-new company of anywhere from 15 to 20 people, to 400 to 500 people,” said Eric Reading, executive vice president at Chemonics. “Our business process and the way we organize ourselves needs to be very flexible and oriented around the rapidly changing nature of the geographic and organizational layouts of our company.”
‘We can work in real time’
Founded in 1975, Chemonics has done humanitarian work all over the globe, including current projects in Afghanistan helping with sustainable agriculture and literacy, policy reform in Jordan and health services in Angola, as well as dozens more. The process calls for a local office to be set up in the corresponding region, with recruiting and hiring of talent both worldwide and local to that region.
“We have roughly 4,500 staff around the world, with the smallest office being a half dozen staff and the largest around 400 people,” Reading said. “It’s a pretty dramatic range of scale we have to work in. A lot of those systems and processes we used were designed during a time when we used telex machines. Things were manual or with little automation due to the geographic separation.”
The growth of cloud hosting allowed Chemonics to think more modernly about its technology, as internet infrastructure can be spotty in some of the developing nations in which it works.
Eric Readingexecutive vice president, Chemonics
“It took us to a place where it was possible to have our whole global organization operating on a single framework for IT and business process,” Reading said. “We can work in real time and collaborate.”
Chemonics researched roughly a dozen different software providers, ultimately narrowing the list to four, then to two — Oracle and Dynamics 365 — before settling on Dynamics for its UI consistency, simplicity and licensing structure.
“The consistency of experience across different parts of the interface was valuable,” Reading said. “There are a lot of elements of business that have to be done a certain way because we’re a government contractor and work on programs that need to comply in a lot of different legal departments. It allowed us to do more at a deeper level without having to completely customize everything.”
And while Chemonics’ first iteration of Dynamics helped with collaboration and consistency among its global projects, it still left some features to be desired in the HR department.
“At the time, there was an incompleteness of the HR offering, and it didn’t satisfy our needs in that area,” Reading said. “We were evaluating options on what do we append in to get that resource functionality. We talked with Microsoft about it, and they asked us to give them a little bit of time to see what was coming down the road.”
Reading said Chemonics was one of the first Microsoft customers to set up Dynamics 365 for Talent for a project in the Dominican Republic.
“After [implementing Dynamics 365 for Talent], we stood up the Dominican Republic office in a 21-day period,” Reading said, adding that the typical goal is 60 days.
Integrating with LinkedIn
Dynamics 365 for Talent was one of two major upgrades that Microsoft brought to its business application earlier this year, with the other being bringing together LinkedIn Sales Navigator and Dynamics 365 for Sales, which allows Dynamics customers to mine LinkedIn’s 500 million members for additional sales leads.
Integrating LinkedIn’s vast amount of professional data into Dynamics also helps with the hiring process that Chemonics needed.
“The new offerings focus on the hiring process, the employee onboarding process and the underlying core needs of HR,” said Mike Ehrenberg, chief strategist for Microsoft. “We’ve had these abilities before, but it’s much more modern and richer now.”
Reading said Chemonics uses LinkedIn as one of the first places to find specialized and specific talent.
“We may need to find an expert in methodology of literacy that can work in a particular language,” Reading said. “Finding that specialized skill set and being able to link it from LinkedIn to the Talent offering is exciting.”
Prior to Dynamics 365 for Talent, the hiring process for Chemonics’ different projects was manual — and the results varied.
“We often had lots of one-page Word documents that may or may not get reused,” Reading said. “We’d have checklists and other manual management work that had a fair level of inconsistency with it.”
Licensing easy to work with
The final aspect that drew Chemonics toward Dynamics 365 was the malleable licensing Microsoft offered, with both an overarching license for management and administrators and a team member license for employees with a simpler routine.
“Our organization doesn’t break down neatly among traditional roles,” Reading said. “The licensing made it easier to manage the process and much more competitive on a pricing standpoint.”
The full use of Dynamics 365 cost $210 per user, per month, with team members’ licenses costing $8 per user, per month to execute basic processes and shared knowledge. There’s also an operations activity license for $50 per user, per month and an operations devices license for $75 per user, per month. Microsoft also offers other cheaper, stripped-down licenses of Dynamics 365, some of which don’t include Dynamics 365 for Talent.