Failover cluster monitoring – quick insight


While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have  no access to source code so I can go wrong on some assumptions.

The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.

image

The dll exports:

  • ClusterGroupStateChange the name gives us some clues
  • ClusterDiscovery this one is used by the discovery workflow

Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.

One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”

image

If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows

Event Type: Warning

Event Source: HealthService

Event Category: Health Service

Event ID: 1207

Date: 10/17/2009

Time: 3:01:23 PM

User: N/A

Computer: ARES1

Description:

Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".

This is it, straightforward but useful if you need to debug issues on your clustered agents.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. #1 by Daniele Muscetta on November 8, 2009 - 9:48 am

    Your reverse engineering studies and the insights you give (yes without source code) are useful to all people in the community. You know what? They are useful even to field people in Microsoft who don’t have source code access either ;-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: