Quae Nocent Docent

What hurts, teaches – Ordinary tales from management trenches

Failover cluster monitoring – quick insight

with one comment

While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have  no access to source code so I can go wrong on some assumptions.

The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.

image

The dll exports:

  • ClusterGroupStateChange the name gives us some clues
  • ClusterDiscovery this one is used by the discovery workflow

Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.

One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”

image

If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows

Event Type: Warning

Event Source: HealthService

Event Category: Health Service

Event ID: 1207

Date: 10/17/2009

Time: 3:01:23 PM

User: N/A

Computer: ARES1

Description:

Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".

This is it, straightforward but useful if you need to debug issues on your clustered agents.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

November 7, 2009 at 11:57 am

Posted in Failover cluster, SCOM

One Response

Subscribe to comments with RSS.

  1. Your reverse engineering studies and the insights you give (yes without source code) are useful to all people in the community. You know what? They are useful even to field people in Microsoft who don’t have source code access either ;-)

    Daniele Muscetta

    November 8, 2009 at 9:48 am


Leave a Reply