Application consistent recovery points with Windows Server 2008/2003 guest OS

I recently had a conversation with a customer around a very interesting problem, and the insights that were gained there are worth sharing. The issue was about VSS errors popping up in the guest event viewer while Hyper-V Replica reported the successful creation of application-consistent (VSS-based) recovery points.

Deployment details

The customer had the following setup that was throwing errors:

  1. Primary site:   Hyper-V Cluster with Windows Server 2012 R2
  2. Replica site:   Hyper-V Cluster with Windows Server 2012 R2
  3. Virtual machines:   SQL server instances with SQL Server 2012 SP1, SQL Server 2005, and SQL Server 2008

At the time of enabling replication, the customer selected the option to create additional recovery points and have the “Volume Shadow Copy Service (VSS) snapshot frequency” as 1 hour. This means that every hour the VSS writer of the guest OS would be invoked to take an application-consistent snapshot.

Symptoms

With this configuration, there was a contradiction in the output – the guest event viewer showed errors/failure during the VSS process, while the Replica VM showed application-consistent points in the recovery history.

Here is an example of the error registered in the guest:

SQLVM: Loc=SignalAbort. Desc=Client initiates abort. ErrorCode=(0). Process=2644. Thread=7212. Client. Instance=. VD=Global*******

 

BACKUP failed to complete the command BACKUP DATABASE model. Check the backup application log for detailed messages.

 

BackupVirtualDeviceFile::SendFileInfoBegin:  failure on backup device '{********-63**-49**-BA**-5DB6********}1'. Operating system error 995(error not found).

Root cause and Dealing with the errors

The big question was:  Why was Hyper-V Replica showing application-consistent recovery points if there are failures?

The behavior seen by the customer is a benign error caused because of the interaction between Hyper-V and VSS, especially for older versions of the guest OS. Details about this can be found in the KB article here: http://support.microsoft.com/kb/2952783

The Hyper-V requestor explicitly stops the VSS operation right after the OnThaw phase. While this ensures application-consistency of the writes going to the disk, it also results in the VSS errors being logged. Meanwhile, Hyper-V returns the consistency correctly to Hyper-V Replica, which in turn makes sure that the recovery side shows application-consistent points.

A great way to validate whether the recovery point is application-consistent or not is to do a test failover on that recovery point. After the VM has booted up, the event viewer logs will have events pertaining to a rollback – and this would mean that the point is not application consistent.

Key Takeaways

  1. All in all, you can rest assured that in the case of VMs with older operating systems, Hyper-V Replica is correctly taking an application-consistent snapshot of the virtual machine.
  2. Although there are errors seen in the guest, they are benign and having a recovery history with application-consistent points is an expected behavior.

Quickly Recovering Replication on Hyper-V

Two weeks ago, I had to recover from a sizable power outage. When this happened, my first priority was to make sure that all of my virtual machines were running well. Once I had done this, my next goal was to get Hyper-V Replica back up and running – so that I would be protected against any future problems.

Now, Hyper-V Replica would have eventually sorted itself out – but I did not want to wait for this to happen organically. I wanted things fixed immediately.

Hyper-V Replica had correctly detected that was a problem, and had scheduled resynchronization for all of my virtual machines. What I did to speed up the process was to shut down all non-critical virtual machines, and then use PowerShell to run the following command:

Get-VM -ComputerName Hyper-V-1, Hyper-V-2 | ?{$_.ReplicationMode -eq “Primary” -and $_.ReplicationHealth -eq “Critical”} | Resume-VMReplication -Resynchronize

This caused replica resynchronization to start immediately for all virtual machines that were reporting that replication was in a critical state. At this stage I must give a word of caution. You may be wandering why I shut down non-critical virtual machines before doing this. The reason is that initiating a mass resynchronization like this will generate a huge amount of disk activity, as Hyper-V goes through and rechecks all of the data on disk. I shut down non-critical systems to try and minimize the amount of data churn that occurred during this process.  Even with this precautionary step, I could feel the system slow down overall while resynchronization was happening.

But after a relatively short period of time, resynchronization was complete and my computers were (almost) back to normal.

Cheers,
Ben