Are your #scom availability reports gray?


After a pretty long time I had to assess IT services availability using some custom Key Health Indicators. Operations Manager with its service based model is the right tool for this kind of assessment and the involved customer has an System Center 2012 Operations Manager (OpsMgr) deployment. I confidently started to drill through the standard Availability reports when with my surprise I found they reported for many services and entities a “Monitor Unavailable” state. The only problem was I knew the health services were up and running during the selected time window.

Just to clear this was the net effect:

 

SNAGHTML95f9bc

I immediately cross checked other OpsMgr Management Groups and I got the same results even if with different mix or percentage of monitoring unavailable entities.

The troubleshooting took some reverse engineering as usual, the conclusions I get is there’s a bug in the synchronization/write process of health services outages in the Data Warehouse. In particular the health service transition from unavailable to available is, in certain unidentified cases, lost. This way the Health Service and all its monitored entities are set as “monitoring unavailable” in the aggregated state tables. This is a huge issue, once the data is aggregated the actual health state is irremediably lost.

I opened a CSS case but after a few weeks I got no answer (sigh). So I share with the community a couple of TSQL scripts.

The first one just asses if you’re hit by the problem crossing availability data between the Live database and the data warehouse. The query assumes that both databases are hosted by the same SQL Server Instance:

USE OperationsManagerDW

select * from vHealthServiceOutage hso

inner join dbo.vManagedEntity me on me.ManagedEntityRowId=hso.ManagedEntityRowId

inner join dbo.vManagedEntityManagementGroup memg on memg.ManagedEntityRowId=me.ManagedEntityRowId and memg.ToDateTime Is null

inner join OperationsManager.dbo.Availability LIve on LIve.BaseManagedEntityId=me.ManagedEntityGuid and Live.IsAvailable=1

where hso.EndDateTime is null and hso.ReasonCode in (select ReasonCode from StateHealthServiceOutage)

order by StartDateTime desc

If the query returns any row then you have health services marked as unavailable in the Data warehouse that are in fact available accordingly the the live database.

The second query tries to fix this issue resetting the situation on the Data Warehouse. I never tried this query in prodution, I just applied it in my Lab. Do not use in production I don’t know if there are any implications or side effects to this direct and unsupported mod to the data wareshouse tables.

USE OpertaionsManagerDW

declare @ids table

(

HealthServiceOutageRowId int

)

insert into @ids

select HealthServiceOutageRowId from vHealthServiceOutage hso

inner join dbo.vManagedEntity me on me.ManagedEntityRowId=hso.ManagedEntityRowId

inner join dbo.vManagedEntityManagementGroup memg on memg.ManagedEntityRowId=me.ManagedEntityRowId and memg.ToDateTime Is null

inner join OperationsManager.dbo.Availability LIve on LIve.BaseManagedEntityId=me.ManagedEntityGuid and Live.IsAvailable=1

where hso.EndDateTime is null and hso.ReasonCode in (select ReasonCode from StateHealthServiceOutage)

order by StartDateTime desc

update        dbo.HealthServiceOutage set EndDateTime = getutcdate()

where HealthServiceOutageRowId in (Select HealthserviceOutageRowId from @ids)

 

– Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

Advertisements
  1. #1 by Daniele Grandini on August 24, 2015 - 11:10 am

    Hi Peter,
    Honestly I don’t know if it has been solved but if you run the query you can immediately check

  2. #2 by Peter on August 19, 2015 - 1:36 pm

    Hi
    did a UR3,or 4, or 5 or -> ever solved this – or is it still necessary with this fix

  3. #3 by Philippe Augras on July 5, 2013 - 3:35 pm

    AWESOME ! I’ve seen this a couple of times during the last months. From what I had read in the forums, it was a DW bug but I never had any clear explanation like yours. Thanks a lot Daniele, you rock :).

    • #4 by Daniele Grandini on July 19, 2013 - 7:31 am

      Hi Philippe I just got a private fix from CSS that addresses this issue, it should be nicluded in UR3 later this month.

  4. #5 by Curtiss on July 5, 2013 - 2:46 pm

    I ran into the same issue. This was supposed to have been fixed in ur2 but I’m still seeing it. Basic Microsoft process was
    -identify entities with null end date time
    -give them an end date time
    -change the “dirtyind” value to mark those entities eligible for aggregation
    -disable the standard data set aggregation rule
    -manually aggregate state data
    -re-enable the standard data set rule.

    • #6 by Daniele Grandini on July 5, 2013 - 2:55 pm

      Hi Curtiss,
      yes you can rebuild the aggregations this way until you have the state changes in the raw data tables… :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: