Hosting Providers and HRM

If you are a hosting provider who is interested in offering DR as a service – you should go over this great post by Gaurav on how Hyper-V Recovery Manager (HRM) helps you build this capability http://blogs.technet.com/b/scvmm/archive/2014/02/18/disaster-recovery-as-a-service-by-hosting-service-providers-using-windows-azure-hyper-v-recovery-manager.aspx

The post provides a high level overview of the capability and also a detailed FAQ on the common set of queries which we have heard from our customers. If you have any further questions, leave a comment in the blog post.

Hyper-V Replica Certificate based authentication and Proxy servers

Continuing from where we left off, I have a small lab deployment which consists of a AD, DNS, Proxy server (Forefront TMG 2010 on WS 2008 R2 SP1), primary servers and replica servers. When the primary server is behind the proxy (forward proxy) and when I tried to enable replication using certificate based authentication, I got the following error message: The handle is in the wrong state for the requested operation (0x00002EF3)

image

That didn’t convey too much, did it? Fortunately I had netmon running in the background and the only set of network traffic which was seen was between the primary server and the proxy. A particular HTTP response caught my eye:

image

The highlighted text indicated that the proxy was terminating the connection and returning a ‘Bad gateway’ error. Closer look at the TMG error log indicated that the error was encountered during https-inspect state.

After some bing’ing of the errors and the pieces began to emerge. When HTTPS inspection is enabled, the TMG server terminates the connection and establishes a new connection (in our case to the replica server) acting as a trusted man-in-the-middle. This doesn’t work for Hyper-V Replica as we mutually authenticate the primary and replica server endpoints. To work around the situation, I disabled HTTPS inspection in the proxy server

image

and things worked as expected. The primary server was able to establish the connection and replication was on track.

Hyper-V Replica & Proxy Servers on primary site

I was tinkering around with my lab setup which consists of a domain, proxy server, primary and replica servers. There are some gotchas when it comes to Hyper-V Replica and proxy servers and I realized that we did not have any posts around this. So here goes.

If the primary server is behind a proxy server (forward proxy) and if Kerberos based authentication is used to establish a connection between the primary and replica server, you might encounter an error: Hyper-V cannot connect to the specified Replica server due to connection timed out. Verify if a network connection exists to the Replica server or if the proxy settings have been configured appropriately to allow replication traffic.

image

I have a Forefront TMG 2010 acting as a proxy server and the logs in the proxy server

image

I also had netmon running in my primary server and the logs didn’t indicate too much other than for the fact that the connection never made it to the replica server – something happened between the primary and replica server which caused the connection to be terminated. The primary server name in this deployment is prb8.hvrlab.com and the proxy server is w2k8r2proxy1.hvrlab.com.

image

If a successful connection goes through, you will see a spew of messages on netmon

When I had observed the issue the first time when building the product, I had reached out to the Forefront folks @ Microsoft to understand this behavior. I came to understand that the Forefront TMG proxy server terminates any outbound (or upload) connections whose content length (request header) is > 4GB.

Hyper-V Replica set a high content length as we expect to transfer large files (VHDs) and it would save us the effort to re-establish the connection each time. A closer inspection of a POST request shows the content length which is being set by Hyper-V Replica (ahem, ~500GB)

image

The proxy server returns a what-uh? response in the form of a bad-request

image

That isn’t superhelpful by any means and the error message unfortunately isn’t too specific either. But now you know the reason for the failure – the proxy server terminates the connection the connection request and it never reaches the replica server.

So how do we work around it – there are two ways (1) Bypass the proxy server (2) Use cert based authentication (another blog for some other day).

The ability to by pass the proxy server is provided only in PowerShell in the ByPassProxyServer parameter of the Enable-VMReplication cmdlet – http://technet.microsoft.com/en-us/library/jj136049.aspx. When the flag is enabled, the request (for lack of better word) bypasses the proxy server. Eg:

Enable-VMReplication -vmname NewVM5 -AuthenticationType Kerberos -ReplicaServerName prb2 -ReplicaServerPort 25000 -BypassProxyServer $true

 

Start-VMInitialReplication -vmname NewVM5

This is not available in the Hyper-V Manager or Failover Cluster Manager UI. It’s supported only in PowerShell (and WMI). Running the above cmdlets will create the replication request and start the initial replication.

Error 0x80090303 when enabling replication

When trying to enable replication on one of my VMs in my lab setup, I encountered the following error – Hyper-V failed to authenticate the Replica server using Kerberos authentication. Error: The specified target is unknown or unreachable (0x80090303).

image

Needless to say, I was able to reach the replica server (prb2.hvrlab.com in my case), firewall settings in the replica server looked ok and I was able to TS and login to the replica server as well. As the error message indicated that the failure was encountered when authenticating the replica server, I decided to check the event viewer logs on the replica server. A couple of errors caught my eye:

(1) SPN registration failures

image

(2) This was followed by an error message which indicated that the authentication had failed

image

I was getting somewhere, so I ran the “setspn –l” command to list down the currently registered SPNs for the computer and the Hyper-V Replica entry was conspicuously absent.

I restarted the vmms service and when I re-ran the command, I could see the following (set of correct) entries

image

I have seen the SPN registration (b.t.w the following TechNet wiki gives more info on SPN registration http://social.technet.microsoft.com/wiki/contents/articles/1340.hyper-v-troubleshooting-event-id-14050-vmms.aspx) failures due to intermittent network blips. There are retry semantics to ensure that the SPN registration succeeds but there could be corner cases (like my messed up lab setup) where a manual intervention may be required to make quicker progress. I also stumbled upon a SPN wiki article: http://social.technet.microsoft.com/wiki/contents/articles/717.service-principal-names-spns-setspn-syntax-setspn-exe.aspx which gives more info on how to manually register the SPN. I didn’t require the info today, but it’s a good read nevertheless.

After fixing the replica server, the enable replication call went through as expected. Back to work…

Hyper-V Replica debugging: Why are very large log files generated?

Quite a few customers have reached out to us with this question, and you can even see a few posts around this on the TechNet Forums. The query comes in various forms:

  • “My log file size was in the MBs and sometime at night it went into the GBs – what happened?”
  • “I have huge amounts of data to sync across once a day when no data is being changed in the guest”
  • “The size of the log file (the .hrl) is growing 10X…”

The problem here is not just the exponential increase in the .hrl file size, but also the fact that the network impact of this churn was not accounted for during the planning stages of the datacenter fabric. Thus there isn’t adequate network between the primary and the replica to transfer the huge log files being generated.

As a first step, the question that customers want answered is: What is causing this churn inside the guest?

Step 1:  Isolate the high-churning processes

Download the script from here: http://gallery.technet.microsoft.com/Hyper-V-Replica-Identify-f09763b6, and copy the script into the virtual machine. The script collects information about the writes done by various processes and writes log files with this data.

I started the debugging process using the script on SQL Server virtual machine of my own. I copied the script into the VM and ran it in an elevated PowerShell window. You might run into PowerShell script execution policy restrictions, and you might need to set the execution policy to Unrestricted (http://technet.microsoft.com/en-us/library/ee176961.aspx).

script running

At the same time, I was monitoring the VM using Perfmon from the host and checking to see if there is any burst of disk activity seen. The blue line in the Perfmon graph is something I was not expecting to see, and it is significantly higher than the rest of the data – the scale for the blue line is 10X that of the red and green lines. (Side note: I was also monitoring the writes from within the guest using Perfmon… to see if there was any mismatch. As you can see from the screenshot below, the two performance monitors are rather in sync :))

Perfmon - Host and Guest - Copy

At this point, I have no clue what in the guest is causing this sort of churn to show up. Fortunately I have the script collecting data inside the guest that I will use for further analysis.

Pull out the two files from the guest VM for analysis in Excel – ProcStats-2.csv and HVRStats-2.csv. Before starting the analysis, one additional bit of Excel manipulation that I added was to include a column called Hour-Minute: it pulls out only the hour and minute from the timestamp (ignoring the seconds) and is used in the PivotTable analysis as a field. I use the following formula in the cell: =TIME(HOUR(A2), MINUTE(A2), 0) where A2 is the timestamp cell for that row. Copy it down and it’ll adjust the formula appropriately.

image

 

Overall write statistics (HVR Stats)

Let’s first look at the file HVRStats-2.csv in Excel. Use the data to create a PivotTable and a PivotChart – this gives a summarized view of the writes happening. What we see is that there is excessive data that gets written at 4:57 AM and 4:58 AM. This is more than 30X of the data written otherwise.

image

Per process write statistics

Now let’s look at ProcStats-2.csv in Excel. Use the data to create a PivotTable and PivotChart – and this should give us a per-process view of what is happening. With the per-process information, we can easily plot the data written by each process and identify the culprit. In this case, SQL Server itself caused a spike in the data written (highlighted in red)

image

 

This is what the graph looks like for a large data copy operation (~1.5 GB). There is a burst of writes between 1:52PM and 1:53PM in Explorer.exe – and this corresponds to the copy operation that was initiated.

image

What next?

At this point, you should be able to differentiate between the following process classes using the process name and PID:

  1. Primary guest workload (eg: SQL Server)
  2. Windows inbox processes (eg: page file, file copy, defragment, search indexer…)
  3. Other/3rd party processes (eg: backup agent, anti-virus…)

Step 2:  Which files are being modified?

Isolating the file sometimes helps in identifying the underlying operation. Once you know which process is causing the churn and at approximately what time, we can use the inbox tool Resource Monitor (resmon.exe) to track the Disk Activity. We can filter to show the details of the processes that we want in the Resource Monitor.

From the previous step you will get the details of the process causing the churn – for example, System (PID 4). Using the Resource Monitor you would find that the file being modified – for example, the file is identified as C:pagefile.sys. This would lead you to the conclusion that it is the pagefile that is being churned.

resmon

 

Alternative tools:

  1. Process Monitor:   http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
  2. Windows Performance Recorder and Windows Performance Analyzer: