March 23rd update. Eventually Microsoft acknowledged the issue: http://support.microsoft.com/kb/2673129/en-us?sd=rss&spid=14134
Well if you have one or both these issues then you probably are in the same situation we stumble into. On a customer site, relying heavily on Hyper-v failover clusters, we started to have “strange” issues:
- The VMs started to shutdown unexpectedly and we got timeout errors from rhs.exe (the failover cluster hosting subsystem). In the cluster log we could read “Found a RHS.exe failure cause a shared failure of all VM configuration that runs in same RHS process”
- The shutdown for single cluster nodes took a lifetime (about 1 hour)
Before going forward a word of warning: we tried to open a CSS ticket but we hadn’t be able to make Microsoft acknowledge the bug. This means that we have a solution, but we don’t have a clear picture of why all this happened only hypothesis.
The affected cluster was initially installed as Windows 2008 R2 and then upgraded to Service Pack 1, it is a 4 nodes cluster with VMs configured to have a preferred node and a failback policy set during the night. We use System Center DPM to perform host based backup (i.e. VM snapshots). Operations Manager as the monitoring solution, Configuration Manager for inventory and patching, Virtual Machine Manager to manage it. Forefront Endpoint Protection is the antivirus solution.
The issues I reported are actually two distinct problems even if we think the first one is somewhat related to the second.
To check if you’re in our situation try the following steps (beware this could cause an unexpected shutdown of all or part of your VMs):
- Set a preferred owner list for your VMs and set a failback policy. Let’s say you now have VM 1 with a preferred owner list such as Node A, Node D. Where Node A and Node D are two different cluster nodes.
- Move VM1 to Node D. Now on Node A stop the cluster service. It should take exactly 49’, be patient it will eventually timeout and stop.
- Now restart the cluster service on Node A and remove the failback policy for VM 1. Move VM1 on Node A and repeat step 2. Now the cluster service will stop immediately.
In our observations failback policies cause a deadlock in the cluster service when VMs are not on their primary preferred node, any not forced shutdown will take approximately 49’ or what’s specified in the cluster private property ShutDownTimeOutInMinutes (http://msdn.microsoft.com/en-us/library/windows/desktop/ee342507(v=vs.85).aspx)
This simple and easy to reproduce situation has not been acknowledge by Microsoft support.
The nasty RHS timeout issue has been a lot harder to troubleshoot, it was not easy to reproduce nor we found any evident cause, but since both the failback stuff and RHS crash are a timeout issue, we started to work on any dependencies we could find. Reviewing the cluster configuration we found about 10,000 ghost devices caused by snapshot backups (see KB 982210 – The startup time increases or hangs at the logon “Welcome” screen if you frequently backup Hyper-V virtual machines on a Windows Server 2008 R2 system), this is a known issue fixed with SP1, we mistakenly thought SP1 would resolve the issue and clean up bogus devices, this is not the case: SP1 fixes the issue indeed, but any ghost device created before SP1 will remain in place.
Our working hypothesis was set: let’s imagine VSS steps in and starts to enumerate devices and their properties (this could be caused by VSS itself, by the DPM agent, by the monitoring stuff) it will surely take a while and let’s imagine this will add to the failback policy issue, the two together would easily surpass the 20’ timeout of RHS during an IsAlive check on resources. When the RHS times out, as an high availability measure, restarts the VMs (that are indeed working properly as our monitor figures state)
So we removed any failback policy and following KB 982210 (see previous reference) we deleted any ghost device we had. From that day we haven’t had any new RHS crash.
Our advice: avoid failback policies on any cluster configuration and check for KB 982210 on your hyper-v hosts, clusters included. You can live without failback with a good monitoring solution such as Operations Manager and you can automate in a much more controller way the failback using System Center Orchestrator.
Hope this will avoid you the headaches we had.
This posting is provided "AS IS" with no warranties, and confers no rights.