R2 and cluster monitoring – issues and pitfalls
I think it’s time to wrap the various answers we can find on cluster monitoring and decommissioning in R2.
Failover clustering is an high availability solution and my quote is “if you want to keep it HA you need to monitor it”. So we all expect OpsMgr to do an egregious job in monitoring failover clusters. In some respect it does but you must be aware of glitches that are still around.
I observed at least 3 issues on failover cluster monitoring:
- abnormal CPU usage on the cluster node and on the controlling management server
- simply put you cannot decommission a cluster in a supported way
- the agentless managed via in the administration space of the monitoring console is just useless
- cluster discovery is noisy
Abnormal CPU usage is caused by internal tasks taking the wrong route. In some cases the config service on the RMS thinks resources are on the wrong node (i.e a node that’s not owning the resources) and sends health recalculation tasks to that node. In this cases you’ll see the following:
- healthservice edb grows we had examples of > 3GB, this grow is caused by tasks queuing in
- healthservice.exe on the affected node consumes up to one cpu core
- healthservice.exe on the primary management server registers an increased cpu usage
We’re waiting for a fix from PSS, at the time the only work around is a group failover, but hey I’m supposed to run critical applications on my clusters a failover cannot be considered a workaround. Btw things can return bad after a while, so I would define this a temporarily rag that will break again sooner or later.
Decommissioning is another issue, the short story is that you cannot remove agents from cluster nodes, but you cannot remove the agents from your console. In other words you have zombies in console.
There’s an *unsupported* workaround here http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!163.entry and a discussion thread here http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/10ffee08-b875-47af-b788-db07dbfa1b56.
See my own unsupported way later in this post, I’m not comfy with the workaround I cite before.
I want to say that the following rapid publishing KB just doesn’t work in my case (and from what I found it should never work): OpsMgr 2007: How to decommission a cluster monitored by System Center Operations Manager 2007
I want to add that the release notes are not clear as well:
“You cannot uninstall or delete an agent from a node in a cluster
When you try to uninstall or delete an agent from a node in cluster, the following error message is displayed:
Agent is managing other devices and cannot be uninstalled. Please resolve this issue via Agentless managed view in Administration prior to attempting uninstall again.
Notice that the agent can be uninstalled from the node that is agentlessly managing the virtual servers. However, the agent cannot be uninstalled from the node that is managing the virtual servers.
Workaround: None at this time.” http://technet.microsoft.com/en-us/library/dd827187.aspx
The agentless managed view is useless in terms that the result’s you’re seeing is just unpredictable and the change proxy action won’t work. This is all related on how the agentless management works for the UI.
An agentless managed Windows Computer (this is the class) is defined as a Windows Computer managed by an healthservice on another Windows Computer (using the HealthServiceShouldManageEntity relationship), this is the query the SDK runs against your live db where the guid is the id for Microsoft.Windows.Computer:
exec sp_executesql N’– AgentlessManagedDevicesByType <ManagedTypeId>
SELECT [T].[Id], [T].[Name], [T].[Path], [T].[FullName], [T].[DisplayName], [T].[IsManaged], [T].[IsDeleted], [T].[LastModified], [T].[TypedManagedEntityId], [T].[MonitoringClassId], [T].[TypedMonitoringObjectIsDeleted], [T].[HealthState], [T].[StateLastModified], [T].[IsAvailable], [T].[AvailabilityLastModified], [T].[InMaintenanceMode], [T].[MaintenanceModeLastModified], [PXH].[BaseManagedEntityId] AS [HealthServiceId], [PXH].[DisplayName] AS [ProxyAgentPrincipalName] FROM dbo.ManagedEntityGenericView AS T
INNER JOIN dbo.BaseManagedEntity AS BME
ON
BME.[BaseManagedEntityId] = T.[Id]
AND BME.[BaseManagedTypeId] = @ManagedTypeId
INNER JOIN dbo.Relationship AS R
ON R.[TargetEntityId] = T.[Id]
INNER JOIN dbo.BaseManagedEntity AS PXH
ON PXH.[BaseManagedEntityId] = R.[SourceEntityId]
WHERE ((
T.[IsDeleted] = 0 AND T.[TypedMonitoringObjectIsDeleted] = 0 AND R.[IsDeleted] = 0 AND
R.[RelationshipTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()
))’,N’@ManagedTypeId uniqueidentifier’,@ManagedTypeId=’EA99500D-8D52-FC52-B5A5-10DCD1E9D2BD’
What happens for cluster is that we have two HealthServiceShouldManageEntity relationships discovered, one from an internal discovery source (in my case with guid “85AB926D-6E0F-4B36-A951-77CCD4399681”), I would call it the standard one or the discovery source the sdk call SetProxyAgent works with, the other is discovered by the cluster management pack by the discovery rule Microsoft.Windows.Cluster.Classes.Discovery”. This discovery is implemented as a native COM in MOMModules.dll.
Then you must add that every cluster node discovers the entire hierarchy of cluster resources (huge load on every node for complex cluster implementations), so every cluster virtual server is managed by every single node. Let’s take a simple example a basic two node cluster (nodeA and nodeB) with just the cluster virtual server (CLUSTER). The Microsoft.Windows.Computer CLUSTER is discovered by both nodes so that we have that nodeA and nodeB have a HealthServiceShouldManageEntity relationship with CLUSTER with source Microsoft.Windows.Cluster.Classes.Discovery and another one with the standard discovery source.
So what happens here is that every cluster node ShouldManage every cluster Virtual Server, for this reason the proxy assignment you see in console is just unpredictable. The selection is based on HealthServiceShouldManageEntity relationship which returns one row for each cluster node for each cluster virtual server.
Using SetAgentProxy from SDK (or change proxy in UI) is useless because it resets just the standard discoverysource and not the Microsoft.Windows.Cluster.Classes.Discovery one.
The root cause of your inhability to delete the virtual servers first and the the cluster nodes then, is related to this discovered relationship. So if you delete the relationship you’ll find your way home.
Here is a SQL snippet that given the cluster node name (FQDN) will delete all the relationships to cluster virtual servers:
declare @nodeHS nvarchar(255)
Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’
DECLARE Rel_Cursor CURSOR FOR
(SELECT [RelationshipGenericView].[Id]
FROM dbo.RelationshipGenericView
WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()) AND (((dbo.[RelationshipGenericView].[IsDeleted] = 0))))
AND ([RelationshipGenericView].[SourceMonitoringObjectId]
IN (select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))
OPEN Rel_Cursor;
declare @relId uniqueidentifier
declare @discoSource uniqueidentifier
declare @now datetime
set @now = GETUTCDATE()
FETCH NEXT FROM Rel_Cursor INTO @RelId;
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @discoSource=DSTR.[DiscoverySourceId]
–SELECT *
FROM dbo.[DiscoverySourceToRelationship] DSTR
inner join dbo.[DiscoverySource] DS on DS.DiscoverySourceId = DSTR.DiscoverySourceId
inner join Discovery on Discovery.DiscoveryId = DS.DiscoveryRuleId
WHERE [DiscoveryName] = ‘Microsoft.Windows.Cluster.Classes.Discovery’
AND [RelationshipId]=@relId
AND DSTR.[IsDeleted] = 0
exec dbo.p_RemoveRelationshipFromDiscoverySourceScope
@RelationshipId=@relId,
@DiscoverySourceId=@discoSource,@TimeGenerated=@now
FETCH NEXT FROM Rel_Cursor INTO @RelId;
END;
CLOSE Rel_Cursor;
DEALLOCATE Rel_Cursor;
After executing this SQL statement the cluster node and virtual servers can be deleted from the Operations Console. Obviously this is totally unsupported. So do this at your own risk, but if you followed my analysis it is clear it should be safe.
Cluster discovery is noisy, this means too much 21025 events (i.e. agents reloads) and the way it is performed on every node can lead to serious impact and cluster nodes. While I was about to post the detail of the noisy discovery (starting with mac address mismatch with OS discovery) a new cluster MP has been released (6.0.6720.0), this one promises to solve a lot of the issue I detected in previous versions. So this one is strongly recommended. btw it adds support for Windows Server 2008 R2 failover clusters (yes CSV included).
- Daniele
This posting is provided “AS IS” with no warranties, and confers no rights.
Hello,
Just for information:
declare @nodeHS nvarchar(255) is duplicated at the beginning of the snippet
Thanks,
Dom
Dominique
December 15, 2009 at 9:42 pm
thx just corrected
Daniele Grandini
December 16, 2009 at 9:10 am
Hello,
I am trying to run this process but as I have several Cluster Servers within the RMS console I wonder where do I specify the one(s) i would like to delete or remove?
Or does this script run on all Clusters?
Thanks,
Dom
Dominique
December 15, 2009 at 9:35 pm
Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’
got it thanks
Dominique
December 15, 2009 at 9:36 pm
[...] a comment » While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have no access to [...]
Failover cluster monitoring – quick insight « Quae Nocent Docent
November 7, 2009 at 11:57 am