Failover cluster monitoring – quick insight
While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have no access to source code so I can go wrong on some assumptions.
The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.
The dll exports:
- ClusterGroupStateChange the name gives us some clues
- ClusterDiscovery this one is used by the discovery workflow
Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.
One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”
If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows
Event Type: Warning
Event Source: HealthService
Event Category: Health Service
Event ID: 1207
Date: 10/17/2009
Time: 3:01:23 PM
User: N/A
Computer: ARES1
Description:
Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".
This is it, straightforward but useful if you need to debug issues on your clustered agents.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Your reverse engineering studies and the insights you give (yes without source code) are useful to all people in the community. You know what? They are useful even to field people in Microsoft who don’t have source code access either ;-)
Daniele Muscetta
November 8, 2009 at 9:48 am