OpsMgr 2007 R2 – lessons learned reprise

In my previous “OpsMgr 2007 R2 – lessons learned” I reported our shortlist after R2 upgrade, I’ve been asked to share more info on agent health and common issues we faced.

First of all I must confirm that heartbeating agents are not by themselves healthy agents. We still have agents that send heartbeat but are not uploading data the the OpsMgr infrastructure. This can be due to several causes and in every case we faced, a simple restart of the agents did the magic. I post the two queries we’re using below.

On the agent CPU usage issue I must address you to two posts:

  1. Troubleshooting 21025 events – wrap up
  2. KB 933061 and some interesting effects

On runaway agents on cluster nodes I want to anticipate a QFE is on the run, the PSS didn’t give me an ETA, but if you’re affected by this issue raise your voice so that we can shorter the release cycle. The issue is related to rollup monitors recalculate health internal task being routed to the wrong cluster node (i.e. the node that doesn’t own the affected resource).

SQL Queries

1 – All agents and their last collected event, marked KO if older than 4 hours


select #T.Path, #E.LoggingComputer, CAST(MAX(LastTime) as nvarchar(50)) As ‘LastEvent’,

CASE WHEN Isnull(MAX(LastTime),’01-01-80′) < DateAdd(hh,-4,getutcdate()) Then ‘KO’ Else ‘OK’ END As ‘Status’ from


select CAST(ME.Path as nvarchar(255)) As [Path], CASE

WHEN IsNull([Path], )= THEN

WHEN CHARINDEX(‘.’, [Path]) = 0 Then [Path]


END As ‘Netbios’

from dbo.ManagedEntityGenericView ME (NOLOCK)

inner join dbo.ManagedTypeView MT (NOLOCK) on ME.MonitoringClassId=MT.Id

    AND MT.Name = ‘Microsoft.SystemCenter.HealthService’

where IsDeleted=0

) #T

left join 

( select distinct LoggingComputer, MAX(TimeGenerated) As ‘LastTime’

from dbo.EventView (NOLOCK) where TimeGenerated > dateadd(hh,-8,getutcdate()) group by LoggingComputer) #E

on #E.LoggingComputer = #T.Path or #E.LoggingComputer=#T.[Netbios]

group by Path, NetBios, LoggingComputer

2 – All agents and their last collected performance point, marked KO if older than 4 hours

select CAST(ME.Path as nvarchar(255)), CAST(Max(TimeSampled) As nvarchar(50)) As ‘LastSample’, CASE WHEN Isnull(MAX(TimeSampled),’01-01-80′) < DateAdd(hh,-4,getutcdate()) Then ‘KO’ Else ‘OK’ END

from dbo.ManagedEntityGenericView ME WITH(NOLOCK)

inner join dbo.ManagedTypeView MT WITH(NOLOCK) on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’

left join dbo.PerformanceCounterView C WITH(NOLOCK) on ME.Id = C.ManagedEntityId

left join dbo.PerformanceDataAllView P WITH(NOLOCK) on C.PerformanceSourceInternalId=P.PerformanceSourceInternalId and P.TimeSampled > dateadd(hh,-8,getutcdate())

where ME.IsDeleted=0

group by ME.Path

Obviously for the two queries to work you must be sure to have collecting rules, in our case we enabled event id 6022 collection (OpsMgr log, Source HealthService Script), this is logged every 15 minutes and won’t place a significant load on your OpsMgr infrastructure. On the performance side we’re collecting HealthService and MonitoringHost CPU Usage.

As anticipated a restart is usually enough to get things going, but here is where I discovered that’s not in the product feature the ability to run a recovery from the proper Management Server, but this is another story that deserves a different post.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

  1. #1 by marcus oh on July 30, 2009 - 2:16 pm


    do you have a version of this sans formatting? the quotes are all jacked up. thanks!

  1. Does my Operations manager environment healthy ? « Kobi's space

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: