R2 and cluster monitoring – issues and pitfalls

I think it’s time to wrap the various answers we can find on cluster monitoring and decommissioning in R2.

Failover clustering is an high availability solution and my quote is “if you want to keep it HA you need to monitor it”. So we all expect OpsMgr to do an egregious job in monitoring failover clusters. In some respect it does but you must be aware of glitches that are still around.

I observed at least 3 issues on failover cluster monitoring:

  1. abnormal CPU usage on the cluster node and on the controlling management server
  2. simply put you cannot decommission a cluster in a supported way
  3. the agentless managed via in the administration space of the monitoring console is just useless
  4. cluster discovery is noisy

Abnormal CPU usage is caused by internal tasks taking the wrong route. In some cases the config service on the RMS thinks resources are on the wrong node (i.e a node that’s not owning the resources) and sends health recalculation tasks to that node. In this cases you’ll see the following:

  • healthservice edb grows we had examples of > 3GB, this grow is caused by tasks queuing in
  • healthservice.exe on the affected node consumes up to one cpu core
  • healthservice.exe on the primary management server registers an increased cpu usage

We’re waiting for a fix from PSS, at the time the only work around is a group failover, but hey I’m supposed to run critical applications on my clusters a failover cannot be considered a workaround. Btw things can return bad after a while, so I would define this a temporarily rag that will break again sooner or later.

Decommissioning is another issue, the short story is that you cannot remove agents from cluster nodes, but you cannot remove the agents from your console. In other words you have zombies in console.

There’s an *unsupported* workaround here http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!163.entry and a discussion thread here http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/10ffee08-b875-47af-b788-db07dbfa1b56.

See my own unsupported way later in this post, I’m not comfy with the workaround I cite before.

I want to say that the following rapid publishing KB just doesn’t work in my case (and from what I found it should never work): OpsMgr 2007: How to decommission a cluster monitored by System Center Operations Manager 2007

I want to add that the release notes are not clear as well:

“You cannot uninstall or delete an agent from a node in a cluster

When you try to uninstall or delete an agent from a node in cluster, the following error message is displayed:

Agent is managing other devices and cannot be uninstalled. Please resolve this issue via Agentless managed view in Administration prior to attempting uninstall again.

Notice that the agent can be uninstalled from the node that is agentlessly managing the virtual servers. However, the agent cannot be uninstalled from the node that is managing the virtual servers.

Workaround: None at this time.” http://technet.microsoft.com/en-us/library/dd827187.aspx

The agentless managed view is useless in terms that the result’s you’re seeing is just unpredictable and the change proxy action won’t work. This is all related on how the agentless management works for the UI.

An agentless managed Windows Computer (this is the class) is defined as a Windows Computer managed by an healthservice on another Windows Computer (using the HealthServiceShouldManageEntity relationship), this is the query the SDK runs against your live db where the guid is the id for Microsoft.Windows.Computer:

exec sp_executesql N’– AgentlessManagedDevicesByType <ManagedTypeId>

SELECT [T].[Id], [T].[Name], [T].[Path], [T].[FullName], [T].[DisplayName], [T].[IsManaged], [T].[IsDeleted], [T].[LastModified], [T].[TypedManagedEntityId], [T].[MonitoringClassId], [T].[TypedMonitoringObjectIsDeleted], [T].[HealthState], [T].[StateLastModified], [T].[IsAvailable], [T].[AvailabilityLastModified], [T].[InMaintenanceMode], [T].[MaintenanceModeLastModified], [PXH].[BaseManagedEntityId] AS [HealthServiceId], [PXH].[DisplayName] AS [ProxyAgentPrincipalName] FROM dbo.ManagedEntityGenericView AS T

INNER JOIN dbo.BaseManagedEntity AS BME 

            BME.[BaseManagedEntityId] = T.[Id]

            AND BME.[BaseManagedTypeId] = @ManagedTypeId

INNER JOIN dbo.Relationship AS R 
    ON R.[TargetEntityId] = T.[Id]

INNER JOIN dbo.BaseManagedEntity AS PXH 
    ON PXH.[BaseManagedEntityId] = R.[SourceEntityId]


            T.[IsDeleted] = 0 AND T.[TypedMonitoringObjectIsDeleted] = 0 AND R.[IsDeleted] = 0 AND

            R.[RelationshipTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()

          ))’,N’@ManagedTypeId uniqueidentifier’,@ManagedTypeId=’EA99500D-8D52-FC52-B5A5-10DCD1E9D2BD’

What happens for cluster is that we have two HealthServiceShouldManageEntity relationships discovered, one from an internal discovery source (in my case with guid “85AB926D-6E0F-4B36-A951-77CCD4399681”), I would call it the standard one or the discovery source the sdk call SetProxyAgent works with, the other is discovered by the cluster management pack by the discovery rule Microsoft.Windows.Cluster.Classes.Discovery”. This discovery is implemented as a native COM in MOMModules.dll.

Then you must add that every cluster node discovers the entire hierarchy of cluster resources (huge load on every node for complex cluster implementations), so every cluster virtual server is managed by every single node. Let’s take a simple example a basic two node cluster (nodeA and nodeB) with just the cluster virtual server (CLUSTER). The Microsoft.Windows.Computer CLUSTER is discovered by both nodes so that we have that nodeA and nodeB have a HealthServiceShouldManageEntity relationship with CLUSTER with source  Microsoft.Windows.Cluster.Classes.Discovery and another one with the standard discovery source.

  So what happens here is that every cluster node ShouldManage every cluster Virtual Server, for this reason the proxy assignment you see in console is just unpredictable. The selection is based on HealthServiceShouldManageEntity relationship which returns one row for each cluster node for each cluster virtual server.

Using SetAgentProxy from SDK (or change proxy in UI) is useless because it resets just the standard discoverysource and not the Microsoft.Windows.Cluster.Classes.Discovery one.

The root cause of your inhability to delete the virtual servers first and the the cluster nodes then, is related to this discovered relationship. So if you delete the relationship you’ll find your way home.

Here is a SQL snippet that given the cluster node name (FQDN) will delete all the relationships to cluster virtual servers:

declare @nodeHS nvarchar(255)

Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’


(SELECT [RelationshipGenericView].[Id]

FROM dbo.RelationshipGenericView

WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] =  dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()) AND (((dbo.[RelationshipGenericView].[IsDeleted] = 0))))

AND ([RelationshipGenericView].[SourceMonitoringObjectId]

    IN (select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))

OPEN Rel_Cursor;

declare @relId uniqueidentifier

declare @discoSource uniqueidentifier

declare @now datetime

set @now = GETUTCDATE()




    SELECT @discoSource=DSTR.[DiscoverySourceId]

    –SELECT *

    FROM  dbo.[DiscoverySourceToRelationship] DSTR

        inner join dbo.[DiscoverySource] DS on DS.DiscoverySourceId = DSTR.DiscoverySourceId

        inner join Discovery on Discovery.DiscoveryId = DS.DiscoveryRuleId

                   WHERE [DiscoveryName] = ‘Microsoft.Windows.Cluster.Classes.Discovery’

                   AND [RelationshipId]=@relId

                   AND DSTR.[IsDeleted] = 0

    exec dbo.p_RemoveRelationshipFromDiscoverySourceScope


    FETCH NEXT FROM Rel_Cursor INTO @RelId;


CLOSE Rel_Cursor;


After executing this SQL statement the cluster node and virtual servers can be deleted from the Operations Console. Obviously this is totally unsupported. So do this at your own risk, but if you followed my analysis it is clear it should be safe.

Cluster discovery is noisy, this means too much 21025 events (i.e. agents reloads) and the way it is performed on every node can lead to serious impact and cluster nodes. While I was about to post the detail of the noisy discovery (starting with mac address mismatch with OS discovery) a new cluster MP has been released (6.0.6720.0), this one promises to solve a lot of the issue I detected in previous versions. So this one is strongly recommended. btw it adds support for Windows Server 2008 R2 failover clusters (yes CSV included).

– Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

  1. #1 by Jerry Ko on November 2, 2012 - 4:21 am

    Thanks so much. It is workable for me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 502 other followers

%d bloggers like this: