R2 and cluster monitoring – issues and pitfalls


I think it’s time to wrap the various answers we can find on cluster monitoring and decommissioning in R2.

Failover clustering is an high availability solution and my quote is “if you want to keep it HA you need to monitor it”. So we all expect OpsMgr to do an egregious job in monitoring failover clusters. In some respect it does but you must be aware of glitches that are still around.

I observed at least 3 issues on failover cluster monitoring:

  1. abnormal CPU usage on the cluster node and on the controlling management server
  2. simply put you cannot decommission a cluster in a supported way
  3. the agentless managed via in the administration space of the monitoring console is just useless
  4. cluster discovery is noisy

Abnormal CPU usage is caused by internal tasks taking the wrong route. In some cases the config service on the RMS thinks resources are on the wrong node (i.e a node that’s not owning the resources) and sends health recalculation tasks to that node. In this cases you’ll see the following:

  • healthservice edb grows we had examples of > 3GB, this grow is caused by tasks queuing in
  • healthservice.exe on the affected node consumes up to one cpu core
  • healthservice.exe on the primary management server registers an increased cpu usage

We’re waiting for a fix from PSS, at the time the only work around is a group failover, but hey I’m supposed to run critical applications on my clusters a failover cannot be considered a workaround. Btw things can return bad after a while, so I would define this a temporarily rag that will break again sooner or later.

Decommissioning is another issue, the short story is that you cannot remove agents from cluster nodes, but you cannot remove the agents from your console. In other words you have zombies in console.

There’s an *unsupported* workaround here http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!163.entry and a discussion thread here http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/10ffee08-b875-47af-b788-db07dbfa1b56.

See my own unsupported way later in this post, I’m not comfy with the workaround I cite before.

I want to say that the following rapid publishing KB just doesn’t work in my case (and from what I found it should never work): OpsMgr 2007: How to decommission a cluster monitored by System Center Operations Manager 2007

I want to add that the release notes are not clear as well:

“You cannot uninstall or delete an agent from a node in a cluster

When you try to uninstall or delete an agent from a node in cluster, the following error message is displayed:

Agent is managing other devices and cannot be uninstalled. Please resolve this issue via Agentless managed view in Administration prior to attempting uninstall again.

Notice that the agent can be uninstalled from the node that is agentlessly managing the virtual servers. However, the agent cannot be uninstalled from the node that is managing the virtual servers.

Workaround: None at this time.” http://technet.microsoft.com/en-us/library/dd827187.aspx

The agentless managed view is useless in terms that the result’s you’re seeing is just unpredictable and the change proxy action won’t work. This is all related on how the agentless management works for the UI.

An agentless managed Windows Computer (this is the class) is defined as a Windows Computer managed by an healthservice on another Windows Computer (using the HealthServiceShouldManageEntity relationship), this is the query the SDK runs against your live db where the guid is the id for Microsoft.Windows.Computer:

exec sp_executesql N’– AgentlessManagedDevicesByType <ManagedTypeId>

SELECT [T].[Id], [T].[Name], [T].[Path], [T].[FullName], [T].[DisplayName], [T].[IsManaged], [T].[IsDeleted], [T].[LastModified], [T].[TypedManagedEntityId], [T].[MonitoringClassId], [T].[TypedMonitoringObjectIsDeleted], [T].[HealthState], [T].[StateLastModified], [T].[IsAvailable], [T].[AvailabilityLastModified], [T].[InMaintenanceMode], [T].[MaintenanceModeLastModified], [PXH].[BaseManagedEntityId] AS [HealthServiceId], [PXH].[DisplayName] AS [ProxyAgentPrincipalName] FROM dbo.ManagedEntityGenericView AS T

INNER JOIN dbo.BaseManagedEntity AS BME 
    ON

            BME.[BaseManagedEntityId] = T.[Id]

            AND BME.[BaseManagedTypeId] = @ManagedTypeId

INNER JOIN dbo.Relationship AS R 
    ON R.[TargetEntityId] = T.[Id]

INNER JOIN dbo.BaseManagedEntity AS PXH 
    ON PXH.[BaseManagedEntityId] = R.[SourceEntityId]

WHERE ((

            T.[IsDeleted] = 0 AND T.[TypedMonitoringObjectIsDeleted] = 0 AND R.[IsDeleted] = 0 AND

            R.[RelationshipTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()

          ))’,N’@ManagedTypeId uniqueidentifier’,@ManagedTypeId=’EA99500D-8D52-FC52-B5A5-10DCD1E9D2BD’

What happens for cluster is that we have two HealthServiceShouldManageEntity relationships discovered, one from an internal discovery source (in my case with guid “85AB926D-6E0F-4B36-A951-77CCD4399681”), I would call it the standard one or the discovery source the sdk call SetProxyAgent works with, the other is discovered by the cluster management pack by the discovery rule Microsoft.Windows.Cluster.Classes.Discovery”. This discovery is implemented as a native COM in MOMModules.dll.

Then you must add that every cluster node discovers the entire hierarchy of cluster resources (huge load on every node for complex cluster implementations), so every cluster virtual server is managed by every single node. Let’s take a simple example a basic two node cluster (nodeA and nodeB) with just the cluster virtual server (CLUSTER). The Microsoft.Windows.Computer CLUSTER is discovered by both nodes so that we have that nodeA and nodeB have a HealthServiceShouldManageEntity relationship with CLUSTER with source  Microsoft.Windows.Cluster.Classes.Discovery and another one with the standard discovery source.

  So what happens here is that every cluster node ShouldManage every cluster Virtual Server, for this reason the proxy assignment you see in console is just unpredictable. The selection is based on HealthServiceShouldManageEntity relationship which returns one row for each cluster node for each cluster virtual server.

Using SetAgentProxy from SDK (or change proxy in UI) is useless because it resets just the standard discoverysource and not the Microsoft.Windows.Cluster.Classes.Discovery one.

The root cause of your inhability to delete the virtual servers first and the the cluster nodes then, is related to this discovered relationship. So if you delete the relationship you’ll find your way home.

Here is a SQL snippet that given the cluster node name (FQDN) will delete all the relationships to cluster virtual servers:

declare @nodeHS nvarchar(255)

Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’

DECLARE Rel_Cursor CURSOR FOR

(SELECT [RelationshipGenericView].[Id]

FROM dbo.RelationshipGenericView

WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] =  dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()) AND (((dbo.[RelationshipGenericView].[IsDeleted] = 0))))

AND ([RelationshipGenericView].[SourceMonitoringObjectId]

    IN (select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))

OPEN Rel_Cursor;

declare @relId uniqueidentifier

declare @discoSource uniqueidentifier

declare @now datetime

set @now = GETUTCDATE()

FETCH NEXT FROM Rel_Cursor INTO @RelId;

WHILE @@FETCH_STATUS = 0

BEGIN

    SELECT @discoSource=DSTR.[DiscoverySourceId]

    –SELECT *

    FROM  dbo.[DiscoverySourceToRelationship] DSTR

        inner join dbo.[DiscoverySource] DS on DS.DiscoverySourceId = DSTR.DiscoverySourceId

        inner join Discovery on Discovery.DiscoveryId = DS.DiscoveryRuleId

                   WHERE [DiscoveryName] = ‘Microsoft.Windows.Cluster.Classes.Discovery’

                   AND [RelationshipId]=@relId

                   AND DSTR.[IsDeleted] = 0

    exec dbo.p_RemoveRelationshipFromDiscoverySourceScope

    @RelationshipId=@relId,

    @DiscoverySourceId=@discoSource,@TimeGenerated=@now                  
    FETCH NEXT FROM Rel_Cursor INTO @RelId;

END;

CLOSE Rel_Cursor;

DEALLOCATE Rel_Cursor;

After executing this SQL statement the cluster node and virtual servers can be deleted from the Operations Console. Obviously this is totally unsupported. So do this at your own risk, but if you followed my analysis it is clear it should be safe.

Cluster discovery is noisy, this means too much 21025 events (i.e. agents reloads) and the way it is performed on every node can lead to serious impact and cluster nodes. While I was about to post the detail of the noisy discovery (starting with mac address mismatch with OS discovery) a new cluster MP has been released (6.0.6720.0), this one promises to solve a lot of the issue I detected in previous versions. So this one is strongly recommended. btw it adds support for Windows Server 2008 R2 failover clusters (yes CSV included).

- Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

About these ads
  1. #1 by Jerry Ko on November 2, 2012 - 4:21 am

    Thanks so much. It is workable for me.

  2. #2 by GI on March 5, 2012 - 6:37 am

    I wish I could take credit for this but a guy from my team figured out that you can get rid of the clustered agents if you evict servers from the cluster and/or destroy the cluster. then you’ll be able to remove the agent managed clients and agentless managed will disapear straight away. once you destroy the cluster it takes a little while until AD is replicated and ops manager picks up on the change. if you’re just evicting one member of a cluster you will need to uninstall the agent manually and once evicted from the cluster re-install the agent manually so that ops manager notices the change.

    obviously, if the cluster is already dead you might have a problem so might need to revert to the above DB hack.

  3. #3 by Michael on December 22, 2010 - 9:57 am

    Hi

    Two questions:
    - What has the purpose of the first sql query? Also, where do i find the ManagedTypeId?

    - I ran the second query, which executed successfully. When should the Agentless Managed be removed from the console?

    Thanks!
    Michael

    • #4 by Daniele Grandini on December 27, 2010 - 10:52 am

      Hi Michael,
      the first query just returns all the agentless managed computer just like the OpsMgr Console does. The ManagedTypeId can be queried in dbo.ManagedType table.
      After you execute the query *and* remove the Agent Proxy check from the cluster nodes, the virtual nodes should disappear from console and you’ll be able to delete your cluster nodes.

      Regards
      Daniele

      • #5 by Michael on December 28, 2010 - 11:05 am

        I ran the query, removed the Agent Proxy check, but the node is still not disappearing, and i cannot delete it.

        Any suggestions?

        Thanks
        Michael

      • #6 by Daniele Grandini on December 29, 2010 - 10:28 am

        Very strange, I must assume you have more relationships left for your virtual cluster servers. How many rows are returned from the first query for your virtual servers?

      • #7 by Michael on December 30, 2010 - 1:54 pm

        (cant reply your latest reply, dunno why)

        To be honest with you, i cant really run that query. I am getting:

        Msg 102, Level 15, State 1, Line 1
        Incorrect syntax near ‘’’.
        Msg 137, Level 15, State 2, Line 11
        Must declare the scalar variable “@ManagedTypeId”.

        I have to use the Windows Computer class right? Thank you!

      • #8 by Daniele Grandini on December 31, 2010 - 5:24 pm

        Hi Michael,
        it’s an issue related to Live Writer (the tool I use to edit my posts). Just change the quotes with ‘ and run the query. In the returned rows is the solution to your problem.
        Happy New Year
        Daniele

  4. #9 by Peter on July 19, 2010 - 9:36 pm

    This worked for me after converting the HTML apostrophes and removing the “-SELECT *” line. Thanks a ton!

  5. #10 by philip on May 12, 2010 - 3:06 pm

    Hi,

    I seem to stuck in the same situation, when i run your script against the operations manager db i receive an error

    “The multi-part identifier “DSTR.DiscoverySourceId” could not be bound.”

    Am i missing something?

    Cheers

    • #11 by Daniele Grandini on May 15, 2010 - 5:10 pm

      Hi Philip, this error is related to a syntax error in TSQL DSTR is the alias associated to the dbo.[DiscoverySourceToRelationship] table and it definitely contains a filed called DiscoverySourceId. Please check the query syntax and if the error persists try to post the query you’re using.

      ” SELECT @discoSource=DSTR.[DiscoverySourceId]
      FROM dbo.[DiscoverySourceToRelationship] DSTR …”

  6. #12 by Maz on March 26, 2010 - 2:33 am

    Just letting you know I ran into this same issue, however found a way just now to remove them easily enough from the console.

    Ensure that both the clustered nodes no longer have the ‘Allow this agent to act as a proxy and discover managed objects on other computers’ checked on the ‘security’ tab under Agent managed.

    You should be able to then remove each of the nodes, and along with that the ‘agentless’ virtual monitoring should be removed :)

    I also did select ‘delete’ under agentless managed on the objects i needed to remove, just before unchecking the proxy setting, however that seemed to do nothing, but maybe it was part of the solution. never do know …

    • #13 by Daniele Grandini on May 15, 2010 - 5:14 pm

      Maz, when I read your comment I though I was wrong with the entire analysis, in fact your method was aligned to the one cited in the release notes. It took a little long to retry the same scenario, but, at least in my environment, this doesn’t work. I need to use my query to delete the cluster nodes.

  7. #14 by Dominique on December 15, 2009 - 9:42 pm

    Hello,

    Just for information:
    declare @nodeHS nvarchar(255) is duplicated at the beginning of the snippet

    Thanks,
    Dom

  8. #16 by Dominique on December 15, 2009 - 9:35 pm

    Hello,

    I am trying to run this process but as I have several Cluster Servers within the RMS console I wonder where do I specify the one(s) i would like to delete or remove?
    Or does this script run on all Clusters?

    Thanks,
    Dom

    • #17 by Dominique on December 15, 2009 - 9:36 pm

      Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’
      got it thanks

  1. 2010 in review « Quae Nocent Docent
  2. You cannot uninstall or delete an Opsmgr 2007 agent from a node in a cluster « mauricekok.nl
  3. Failover cluster monitoring – quick insight « Quae Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 310 other followers

%d bloggers like this: