Failover cluster monitoring – quick insight

While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have  no access to source code so I can go wrong on some assumptions.

The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.


The dll exports:

  • ClusterGroupStateChange the name gives us some clues
  • ClusterDiscovery this one is used by the discovery workflow

Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.

One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”


If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows

Event Type: Warning

Event Source: HealthService

Event Category: Health Service

Event ID: 1207

Date: 10/17/2009

Time: 3:01:23 PM

User: N/A

Computer: ARES1


Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".

This is it, straightforward but useful if you need to debug issues on your clustered agents.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

  1. #1 by Daniele Muscetta on November 8, 2009 - 9:48 am

    Your reverse engineering studies and the insights you give (yes without source code) are useful to all people in the community. You know what? They are useful even to field people in Microsoft who don’t have source code access either ;-)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: