Troubleshooting StateChangeEvent grow


In the past few weeks we had a major issue with our  OpsMgr infrastructure. The net effect was the OpsMgr db filling up in a very short period of time (less than 24 hours). Using SQL reporting to check for the guilty table, addressed us to the StateChangeEvent table: 6 millions rows and 22 GB of used space. The following post will try to explain how to troubleshoot such a situation for both the live database and the data warehouse.

A call for the team: please, as you did for alert storms do something for state change storms.

For those interested you can find all the queries used in this post here.

As described in the preamble, the first to do is to find which dataset is causing the db filling up, the easiest way to do that is to use SQL reporting "Disk usage by Top Tables"

image

In our case we found the StateChangeEvent to be the bad guy in town.

Next step is to collect some data about which monitors are causing the flood and when the flooding started. Two queries here, the first will identify the guilty monitors, the second one will give an insight of the trends in the last 90 days (probably too much, but I will return on this later)

– most contributing state change entities in the last 24 hours
select TOP 10 BME.BaseManagedEntityId, BME.FullName, BME.Path
,Count(*) ‘StateChange’
from StateChangeEvent SCE WITH (NOLOCK)
inner join dbo.State S WITH (NOLOCK) on S.StateId=SCE.StateId
inner join dbo.BaseManagedEntity BME WITH (NOLOCK) on BME.BaseManagedEntityId=S.BaseManagedEntityId
where SCE.TimeAdded > DateAdd(hh,-24,getutcdate())
group by BME.BaseManagedEntityId, BME.FullName, BME.Path
order by StateChange DESC

– state change events per day, last 90 days

select
datepart(year, timegenerated) As ‘year’, datepart(month, timegenerated) ‘month’, datepart(day, timegenerated) ‘day’,
count(*) ‘changes’
from statechangeevent with(nolock)
inner join state with(nolock)
on statechangeevent.stateid = state.stateid
inner join basemanagedentity with(nolock)
on state.basemanagedentityid = basemanagedentity.basemanagedentityid
where timegenerated > dateadd(d, -90, getdate())
group by datepart(year, timegenerated), datepart(month, timegenerated), datepart(day, timegenerated)
order by datepart(year, timegenerated), datepart(month, timegenerated), datepart(day, timegenerated)

Now it should be clearer which are the bad monitors and from when the flooding started. The first thing to do is to disable the involved agent(s), so that your infrastructure can have some relief. Then you have several options to resolve the issue, our approach was simple enough, this is really bad data we don’t want it in our db, so just delete this stuff keeping last hour of changes (just in case).

– Selective deletion keeping last hour of changes

declare @me as uniqueidentifier
declare @lastChange as datetime

    DECLARE @RowCount int

Set @me=’AA3B485E-3C21-8344-11CB-4C93EE28C317’  — you can get this ID from the first query "most contributing state change entities in the last 24 hours"

select @lastChange=MAX(TimeGenerated)
            FROM dbo.StateChangeEvent SCE
            JOIN dbo.State S WITH(NOLOCK)
                ON SCE.[StateId] = S.[StateId]
            WHERE S.BaseManagedEntityId=@me

Set @lastChange = DATEADD(hh,-1,@lastChange)
    EXEC dbo.p_ConvertLocalTimeToUTC @lastCHange, @lastChange OUT
– convert to UTC

        DELETE MJS
        FROM dbo.MonitoringJobStatus MJS
        JOIN dbo.StateChangeEvent SCE
            ON SCE.StateChangeEventId = MJS.StateChangeEventId
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE S.BaseManagedEntityId=@me AND SCE.TimeGenerated < @lastChange

Set @RowCount=1

               WHILE (@RowCount > 0)
    BEGIN

        DELETE TOP (100000) SCE
            FROM dbo.StateChangeEvent SCE
            JOIN dbo.State S WITH(NOLOCK)
                ON SCE.[StateId] = S.[StateId]
            WHERE S.BaseManagedEntityId=@me AND SCE.TimeGenerated < @lastChange

            Set @RowCount = @@ROWCOUNT

    END

Unsupported but working for us.

The state change event grooming story

We found another interesting story, our issue was caused by a bad monitor related to AdtServer monitoring that filled the db with 2 millions rows worth 20 GB of data, but we still have too many rows in the StateChangeEvent (4 millions actually) and, since we are supposed to keep just 7 days of State Change data, we clearly have too much data here. It turned out this behavior is documented in a post from Kevin Holman (http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx). Only state change events related to healthy monitors get groomed. We don’t like this behavior, we will probably keep unhealthy monitors state change event a little longer than healthy ones, but not forever, and we don’t want unknown monitors (unmonitored/white ones) state change event data to mess around my db. So we took the standard grooming stored procedure and created a new one that differentiate the grooming between healthy and unmonitored and unhealthy monitors. Today we schedule the batch on a daily basis so that we won’t have any more so much data in the StateChangeEvent table.

Our scheduling keeps healthy and unmonitored State Change Events for 7 days and unhealthy state change events for 30 days. After the deletion we always force an index rebuild. As usual: TOTALLY UNSUPPORTED.

exec [p_QND_StateChangeEventGroomingHard] 7, 0, 100000

exec [p_QND_StateChangeEventGroomingHard] 30, 1, 100000

ALTER INDEX ALL ON StateChangeEvent REBUILD

 

– p_QND_StateCHangeEventGroomingHard

– taken from StoredProcedure [dbo].[p_StateChangeEventGrooming]  

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[p_QND_StateChangeEventGroomingHard]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [dbo].[p_QND_StateChangeEventGroomingHard]
GO
/****** Object:  StoredProcedure [dbo].[p_StateChangeEventGrooming]    Script Date: 06/11/2010 09:54:43 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[p_QND_StateChangeEventGroomingHard]
    @Days as int,
    @All as bit,
    @batch as int
AS
BEGIN

    SET NOCOUNT ON

    DECLARE @Err int
    DECLARE @Ret int
    DECLARE @DaysToKeep tinyint
    DECLARE @GroomingThresholdLocal datetime
    DECLARE @GroomingThresholdUTC datetime
    DECLARE @TimeGroomingRan datetime
    DECLARE @MaxTimeGroomed datetime
    DECLARE @RowCount int
    SET @TimeGroomingRan = getutcdate()
– mod
    SET @GroomingThresholdLocal = dbo.fn_GroomingThreshold(@Days, getdate())

    EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
    SET @Err = @@ERROR

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    SET @RowCount = 1  

    — This is to update the settings table
    — with the max groomed data
    SELECT @MaxTimeGroomed = MAX(TimeGenerated)
    FROM dbo.StateChangeEvent
    WHERE TimeGenerated < @GroomingThresholdUTC

    IF @MaxTimeGroomed IS NULL
        GOTO Success_Exit

    — Instead of the FK DELETE CASCADE handling the deletion of the rows from
    — the MJS table, do it explicitly. Performance is much better this way.

    IF @All=1
    BEGIN   
        DELETE MJS
        FROM dbo.MonitoringJobStatus MJS
        JOIN dbo.StateChangeEvent SCE
            ON SCE.StateChangeEventId = MJS.StateChangeEventId
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE SCE.TimeGenerated < @GroomingThresholdUTC
    END
    ELSE
    BEGIN
        DELETE MJS
        FROM dbo.MonitoringJobStatus MJS
        JOIN dbo.StateChangeEvent SCE
            ON SCE.StateChangeEventId = MJS.StateChangeEventId
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE SCE.TimeGenerated < @GroomingThresholdUTC
        AND S.[HealthState] in (0,1)
    END

    SELECT @Err = @@ERROR
    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    WHILE (@RowCount > 0)
    BEGIN
        — Delete StateChangeEvents that are older than @GroomingThresholdUTC
        — We are doing this in chunks in separate transactions on
        — purpose: to avoid the transaction log to grow too large.
        IF @All=1
        BEGIN
            DELETE TOP (@batch) SCE
            FROM dbo.StateChangeEvent SCE
            JOIN dbo.State S WITH(NOLOCK)
                ON SCE.[StateId] = S.[StateId]
            WHERE TimeGenerated < @GroomingThresholdUTC
        END
        ELSE
        BEGIN
            DELETE TOP (@Batch) SCE
            FROM dbo.StateChangeEvent SCE
            JOIN dbo.State S WITH(NOLOCK)
                ON SCE.[StateId] = S.[StateId]
            WHERE TimeGenerated < @GroomingThresholdUTC
            AND S.[HealthState] in (0,1)
        END

        SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

        IF (@Err <> 0)
        BEGIN
            GOTO Error_Exit
        END
    END   

    /*
    UPDATE dbo.PartitionAndGroomingSettings
    SET GroomingRunTime = @TimeGroomingRan,
        DataGroomedMaxTime = @MaxTimeGroomed
    WHERE ObjectName = ‘StateChangeEvent’
*/
    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END 

Success_Exit:
    PRINT ‘Success’
    RETURN 0

Error_Exit:   
    PRINT ‘Error’
    RETURN 1
END

 

The data warehouse story

Sadly this was not all, we had the same issue on the data warehouse side. Too much state change events in the raw table, so much the standard aggregation would never complete. Tons of HealthServiceModules/Data warehouse errors in the event log with ids 31552, 31553 and so on.

image

What will follow is pretty generic and can be applied to any standard data set maintenance error. We wanted to be sure so we approached the whole stuff from scratch, even if we were sure this issue was related to the state change storm we had.

You must first check the exact dataset involved with this simple query (you must change the guid with the one reported in the health context, in my case ‘{821BEE3F-BE84-51F1-49C2-C87D93494D42}’)

select DS.SchemaName, DS.LastOptimizationActionDateTime, DS.LastOptimizationActionSuccessfulCompletionDateTime, SDAS.BaseTableName, SDAS.AggregationTypeId, SDA.LastGroomingDateTime, SDA.BuildAggregationStoredProcedureName, SDA.GroomStoredProcedureName
from dbo.StandardDataset DS
inner join dbo.ManagedEntity ME on ME.Name = DS.DatasetId
inner join dbo.StandardDatasetAggregation SDA on SDA.DatasetId=DS.DatasetId
inner join dbo.StandardDatasetAggregationStorage SDAS on SDAS.DatasetId=DS.DatasetId
where ManagedEntityGuid='{821BEE3F-BE84-51F1-49C2-C87D93494D42}’

Once you know for sure the dataset you can identify the bad data  (in our case it was a state change issue, so we used the vStateRaw view to check for guilty monitors):

select top 10 ManagedEntityMonitorRowId, count(*) from State.vStateRaw
group by ManagedEntityMonitorRowId
order by COUNT(*) desc

You can then ask for more info on the monitors (you could as well run a single query with the appropriate joins), and you should try to have some more information about the type of failure, to do this the easiest way is to set the DebugLevel (=5 seems to log enough) field in the StandardDateSet table for the dataset you’re interested in, then you can poll the DebugMessage table to see what’s going on; or you can simply run manually the dataset maintenance and check the results. Remember to reset the DebugLevel field once you have collected enough data.

select * from dbo.vManagedEntityMonitor MEM
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=MEM.ManagedEntityRowId
where MEM.ManagedEntityMonitorRowId=511195

or in just one step

select top 10 S.ManagedEntityMonitorRowId, M.MonitorDefaultName, Count(*) from State.vStateRaw S
inner join dbo.vManagedEntityMonitor MEM on MEM.ManagedEntityMonitorRowId=S.ManagedEntityMonitorRowId
inner join dbo.vMonitor M on M.MonitorRowId = MEM.MonitorRowId
group by S.ManagedEntityMonitorRowId, M.MonitorDefaultName
order by COUNT(*) desc

Now that you have the troublesome ManagedEntityMonitorRowId(s) you can start, if you want, to delete the state change events from the raw tables, for example:

delete State.StateRaw_5E20669A64C84E06964070CA3670A1EA where ManagedEntityMonitorRowId=511195

you should repeat the process for every raw table involved, or you can use a cursor, but since they’re few we went manually. This will delete all the state change events for that monitor(s), this can be a little too much. We took the approach to identify what’s was going wrong, when and then selectively delete the bad data. Since our error was a timeout error during aggregation (we got this info from the DebugMessage table) we could have taken a couple of different approaches:

  1. stop the RMS and manually execute the  dbo.StandardDatasetMaintenance (with the proper datasetid, in our case ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9′) until all the aggregations are completed successfully (see below). This is the lossless approach but forces a stop of you OpsMgr infrastructure.
  2. once the problematic time window has been identified you can force the aggregation to skip setting DirtyInd = 0 and AggregationCount=9 in table StandardDatasetAggregationHistory for the dataset and time window you wanna skip. This way you skip the aggregation of the troublesome timewindow for *all* the monitors (uhm probably not a way you want to go).

All in all, we decided to selectively delete the bad data, and this is the approach I recommend (remember, not supported). So, first check for the problematic time windows (or the latest never aggregated where the entire process is stuck):

–oldest never aggregated, (our dataset id was ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9′) 
    SELECT TOP 1
       StandardDatasetAggregationHistoryRowId
      ,ah.AggregationTypeId
      ,ah.AggregationDateTime
      ,DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime)
    FROM StandardDatasetAggregationHistory ah
        JOIN StandardDatasetAggregation a ON ((a.DatasetId = ah.DatasetId) AND (a.AggregationTypeId = ah.AggregationTypeId))
    WHERE (ah.DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9′)
      AND (DATEADD(minute, a.AggregationIntervalDurationMinutes + a.AggregationStartDelayMinutes, ah.AggregationDateTime) < GETUTCDATE()) — aggregation interval + slight delay is in the past
time + slight delay is in the past
      AND (DirtyInd = 1)
      AND (ah.AggregationCount = 0) — interval was never aggregated
    ORDER BY ah.AggregationTypeId ASC, ah.AggregationDateTime ASC

check for problematic monitors in the time window (for example 20:00 to 21:00)

select top 10 S.ManagedEntityMonitorRowId, M.MonitorDefaultName, Count(*) from State.vStateRaw S
inner join dbo.vManagedEntityMonitor MEM on MEM.ManagedEntityMonitorRowId=S.ManagedEntityMonitorRowId
inner join dbo.vMonitor M on M.MonitorRowId = MEM.MonitorRowId
where DateTime between ‘2010-06-09 20:00:00.000′ AND ‘2010-06-09 23:00:00.000′
group by S.ManagedEntityMonitorRowId, M.MonitorDefaultName
order by COUNT(*) desc

And finally, if appropriate, delete the guilty monitors in that timeframe

delete State.StateRaw_5E20669A64C84E06964070CA3670A1EA
where ManagedEntityMonitorRowId in (511194,511192,511191)
and DateTime between ‘2010-06-09 20:00:00.000′ AND ‘2010-06-09 21:00:00.000′

Another query you could find interesting is the following where you can identify the oldest dirty aggregation, a dirty aggregation is somewhat an aggregation that needs some more processing:

– oldest dirty
      SELECT TOP 1
        StandardDatasetAggregationHistoryRowId
        ,ah.AggregationTypeId
        ,ah.AggregationDateTime
        ,DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime)
      FROM StandardDatasetAggregationHistory ah
          JOIN StandardDatasetAggregation a ON ((a.DatasetId = ah.DatasetId) AND (a.AggregationTypeId = ah.AggregationTypeId))
      WHERE (ah.DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9′)
        AND (DATEADD(minute, a.AggregationIntervalDurationMinutes + a.AggregationStartDelayMinutes, ah.AggregationDateTime) < GETUTCDATE()) — aggregation interval + slight delay is in the past
        AND (DATEADD(minute, a.AggregationStartDelayMinutes, ISNULL(ah.DataLastReceivedDateTime, DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime))) < GETUTCDATE()) — data last received time + slight delay is in the past
        AND (DirtyInd = 1)
        AND (NOT EXISTS (SELECT *
                         FROM StandardDatasetAggregationHistory
                         WHERE (DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9′)
                           AND (AggregationTypeId = ah.AggregationTypeId)
                           AND (DirtyInd = 1)
                           AND (AggregationDateTime < ah.AggregationDateTime)
                        )
            ) — this is the oldest aggregation which is dirty
      ORDER BY ah.AggregationTypeId ASC, ah.AggregationDateTime ASC

Conclusions

This is it, it took half a day to get rid of this nasty issue. The troubleshooting steps can be applied to alerts, events and performance datasets as well. Hope you never have such a problem, but in case these steps should help you recover the OpsMgr infrastructure.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

About these ads
  1. #1 by claudiog64 on December 13, 2012 - 10:20 pm

    Daniele, as always .. great stuff !
    :-)

  2. #2 by HC on December 14, 2010 - 9:18 am

    We have had a big problem running “StandardDatasetMaintenance” manually, and have created a Microsoft case – but it ended up that this fantastic blog solved the problem for us.
    Thank you very much!!!!!

    I have explained our problem and how we solved it on this blog http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx

    Please see my comment (HC 14 Dec 2010 12:14 AM) in the bottom.

    Again thank you a lot for the help!!

  3. #3 by Jonathan Almquist on October 1, 2010 - 4:41 pm

    Very good, Daniele. Not sure how I managed to miss this post back in June. I happened to follow the bread crumbs from your latest EX2010 post and stumbled across this one. This goes beyond the standard state change event mitigation steps that I follow (haven’t groomed the DW previously), but I can see why we would want to do this, especially in the case of frequent state change storms.

  1. Troubleshooting StateChangeEvent grow–updates « Quae Nocent Docent
  2. Analisis del Rendimiento de SCOM 2007 R2 Parte 2 « scomizer
  3. A tale of Identity and bad agent behavior « Quae Nocent Docent
  4. StateChangeEvent table and grooming problem in OperationsMananger Database « Kobi's space
  5. Troubleshooting Exchange 2010 MP aggregations « Quae Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 349 other followers

%d bloggers like this: