In my previous “OpsMgr 2007 R2 – lessons learned” I reported our shortlist after R2 upgrade, I’ve been asked to share more info on agent health and common issues we faced.
First of all I must confirm that heartbeating agents are not by themselves healthy agents. We still have agents that send heartbeat but are not uploading data the the OpsMgr infrastructure. This can be due to several causes and in every case we faced, a simple restart of the agents did the magic. I post the two queries we’re using below.
On the agent CPU usage issue I must address you to two posts:
On runaway agents on cluster nodes I want to anticipate a QFE is on the run, the PSS didn’t give me an ETA, but if you’re affected by this issue raise your voice so that we can shorter the release cycle. The issue is related to rollup monitors recalculate health internal task being routed to the wrong cluster node (i.e. the node that doesn’t own the affected resource).
1 – All agents and their last collected event, marked KO if older than 4 hours
select #T.Path, #E.LoggingComputer, CAST(MAX(LastTime) as nvarchar(50)) As ‘LastEvent’,
CASE WHEN Isnull(MAX(LastTime),’01-01-80′) < DateAdd(hh,-4,getutcdate()) Then ‘KO’ Else ‘OK’ END As ‘Status’ from
select CAST(ME.Path as nvarchar(255)) As [Path], CASE
WHEN IsNull([Path], ”)=” THEN ”
WHEN CHARINDEX(‘.’, [Path]) = 0 Then [Path]
ELSE SUBSTRING(Path,1,CHARINDEX(‘.’, Path)-1)
END As ‘Netbios’
from dbo.ManagedEntityGenericView ME (NOLOCK)
inner join dbo.ManagedTypeView MT (NOLOCK) on ME.MonitoringClassId=MT.Id
AND MT.Name = ‘Microsoft.SystemCenter.HealthService’
( select distinct LoggingComputer, MAX(TimeGenerated) As ‘LastTime’
from dbo.EventView (NOLOCK) where TimeGenerated > dateadd(hh,-8,getutcdate()) group by LoggingComputer) #E
on #E.LoggingComputer = #T.Path or #E.LoggingComputer=#T.[Netbios]
group by Path, NetBios, LoggingComputer
2 – All agents and their last collected performance point, marked KO if older than 4 hours
select CAST(ME.Path as nvarchar(255)), CAST(Max(TimeSampled) As nvarchar(50)) As ‘LastSample’, CASE WHEN Isnull(MAX(TimeSampled),’01-01-80′) < DateAdd(hh,-4,getutcdate()) Then ‘KO’ Else ‘OK’ END
from dbo.ManagedEntityGenericView ME WITH(NOLOCK)
inner join dbo.ManagedTypeView MT WITH(NOLOCK) on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’
left join dbo.PerformanceCounterView C WITH(NOLOCK) on ME.Id = C.ManagedEntityId
left join dbo.PerformanceDataAllView P WITH(NOLOCK) on C.PerformanceSourceInternalId=P.PerformanceSourceInternalId and P.TimeSampled > dateadd(hh,-8,getutcdate())
group by ME.Path
Obviously for the two queries to work you must be sure to have collecting rules, in our case we enabled event id 6022 collection (OpsMgr log, Source HealthService Script), this is logged every 15 minutes and won’t place a significant load on your OpsMgr infrastructure. On the performance side we’re collecting HealthService and MonitoringHost CPU Usage.
As anticipated a restart is usually enough to get things going, but here is where I discovered that’s not in the product feature the ability to run a recovery from the proper Management Server, but this is another story that deserves a different post.
This posting is provided "AS IS" with no warranties, and confers no rights.