In the past few weeks we had a major issue with our OpsMgr infrastructure. The net effect was the OpsMgr db filling up in a very short period of time (less than 24 hours). Using SQL reporting to check for the guilty table, addressed us to the StateChangeEvent table: 6 millions rows and 22 GB of used space. The following post will try to explain how to troubleshoot such a situation for both the live database and the data warehouse.
A call for the team: please, as you did for alert storms do something for state change storms.
For those interested you can find all the queries used in this post here.
As described in the preamble, the first to do is to find which dataset is causing the db filling up, the easiest way to do that is to use SQL reporting "Disk usage by Top Tables"
In our case we found the StateChangeEvent to be the bad guy in town.
Next step is to collect some data about which monitors are causing the flood and when the flooding started. Two queries here, the first will identify the guilty monitors, the second one will give an insight of the trends in the last 90 days (probably too much, but I will return on this later)
— most contributing state change entities in the last 24 hours
select TOP 10 BME.BaseManagedEntityId, BME.FullName, BME.Path
,Count(*) ‘StateChange’
from StateChangeEvent SCE WITH (NOLOCK)
inner join dbo.State S WITH (NOLOCK) on S.StateId=SCE.StateId
inner join dbo.BaseManagedEntity BME WITH (NOLOCK) on BME.BaseManagedEntityId=S.BaseManagedEntityId
where SCE.TimeAdded > DateAdd(hh,-24,getutcdate())
group by BME.BaseManagedEntityId, BME.FullName, BME.Path
order by StateChange DESC
— state change events per day, last 90 days
select
datepart(year, timegenerated) As ‘year’, datepart(month, timegenerated) ‘month’, datepart(day, timegenerated) ‘day’,
count(*) ‘changes’
from statechangeevent with(nolock)
inner join state with(nolock)
on statechangeevent.stateid = state.stateid
inner join basemanagedentity with(nolock)
on state.basemanagedentityid = basemanagedentity.basemanagedentityid
where timegenerated > dateadd(d, -90, getdate())
group by datepart(year, timegenerated), datepart(month, timegenerated), datepart(day, timegenerated)
order by datepart(year, timegenerated), datepart(month, timegenerated), datepart(day, timegenerated)
Now it should be clearer which are the bad monitors and from when the flooding started. The first thing to do is to disable the involved agent(s), so that your infrastructure can have some relief. Then you have several options to resolve the issue, our approach was simple enough, this is really bad data we don’t want it in our db, so just delete this stuff keeping last hour of changes (just in case).
— Selective deletion keeping last hour of changes
declare @me as uniqueidentifier
declare @lastChange as datetime
DECLARE @RowCount int
Set @me=’AA3B485E-3C21-8344-11CB-4C93EE28C317’ — you can get this ID from the first query "most contributing state change entities in the last 24 hours"
select @lastChange=MAX(TimeGenerated)
FROM dbo.StateChangeEvent SCE
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE S.BaseManagedEntityId=@me
Set @lastChange = DATEADD(hh,-1,@lastChange)
EXEC dbo.p_ConvertLocalTimeToUTC @lastCHange, @lastChange OUT
— convert to UTC
DELETE MJS
FROM dbo.MonitoringJobStatus MJS
JOIN dbo.StateChangeEvent SCE
ON SCE.StateChangeEventId = MJS.StateChangeEventId
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE S.BaseManagedEntityId=@me AND SCE.TimeGenerated < @lastChange
Set @RowCount=1
WHILE (@RowCount > 0)
BEGIN
DELETE TOP (100000) SCE
FROM dbo.StateChangeEvent SCE
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE S.BaseManagedEntityId=@me AND SCE.TimeGenerated < @lastChange
Set @RowCount = @@ROWCOUNT
END
Unsupported but working for us.
The state change event grooming story
We found another interesting story, our issue was caused by a bad monitor related to AdtServer monitoring that filled the db with 2 millions rows worth 20 GB of data, but we still have too many rows in the StateChangeEvent (4 millions actually) and, since we are supposed to keep just 7 days of State Change data, we clearly have too much data here. It turned out this behavior is documented in a post from Kevin Holman (http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx). Only state change events related to healthy monitors get groomed. We don’t like this behavior, we will probably keep unhealthy monitors state change event a little longer than healthy ones, but not forever, and we don’t want unknown monitors (unmonitored/white ones) state change event data to mess around my db. So we took the standard grooming stored procedure and created a new one that differentiate the grooming between healthy and unmonitored and unhealthy monitors. Today we schedule the batch on a daily basis so that we won’t have any more so much data in the StateChangeEvent table.
Our scheduling keeps healthy and unmonitored State Change Events for 7 days and unhealthy state change events for 30 days. After the deletion we always force an index rebuild. As usual: TOTALLY UNSUPPORTED.
exec [p_QND_StateChangeEventGroomingHard] 7, 0, 100000
exec [p_QND_StateChangeEventGroomingHard] 30, 1, 100000
ALTER INDEX ALL ON StateChangeEvent REBUILD
— p_QND_StateCHangeEventGroomingHard
— taken from StoredProcedure [dbo].[p_StateChangeEventGrooming]
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[p_QND_StateChangeEventGroomingHard]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [dbo].[p_QND_StateChangeEventGroomingHard]
GO
/****** Object: StoredProcedure [dbo].[p_StateChangeEventGrooming] Script Date: 06/11/2010 09:54:43 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[p_QND_StateChangeEventGroomingHard]
@Days as int,
@All as bit,
@batch as int
AS
BEGIN
SET NOCOUNT ON
DECLARE @Err int
DECLARE @Ret int
DECLARE @DaysToKeep tinyint
DECLARE @GroomingThresholdLocal datetime
DECLARE @GroomingThresholdUTC datetime
DECLARE @TimeGroomingRan datetime
DECLARE @MaxTimeGroomed datetime
DECLARE @RowCount int
SET @TimeGroomingRan = getutcdate()
— mod
SET @GroomingThresholdLocal = dbo.fn_GroomingThreshold(@Days, getdate())
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
SET @Err = @@ERROR
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
SET @RowCount = 1
— This is to update the settings table
— with the max groomed data
SELECT @MaxTimeGroomed = MAX(TimeGenerated)
FROM dbo.StateChangeEvent
WHERE TimeGenerated < @GroomingThresholdUTC
IF @MaxTimeGroomed IS NULL
GOTO Success_Exit
— Instead of the FK DELETE CASCADE handling the deletion of the rows from
— the MJS table, do it explicitly. Performance is much better this way.
IF @All=1
BEGIN
DELETE MJS
FROM dbo.MonitoringJobStatus MJS
JOIN dbo.StateChangeEvent SCE
ON SCE.StateChangeEventId = MJS.StateChangeEventId
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE SCE.TimeGenerated < @GroomingThresholdUTC
END
ELSE
BEGIN
DELETE MJS
FROM dbo.MonitoringJobStatus MJS
JOIN dbo.StateChangeEvent SCE
ON SCE.StateChangeEventId = MJS.StateChangeEventId
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE SCE.TimeGenerated < @GroomingThresholdUTC
AND S.[HealthState] in (0,1)
END
SELECT @Err = @@ERROR
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
WHILE (@RowCount > 0)
BEGIN
— Delete StateChangeEvents that are older than @GroomingThresholdUTC
— We are doing this in chunks in separate transactions on
— purpose: to avoid the transaction log to grow too large.
IF @All=1
BEGIN
DELETE TOP (@batch) SCE
FROM dbo.StateChangeEvent SCE
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE TimeGenerated < @GroomingThresholdUTC
END
ELSE
BEGIN
DELETE TOP (@Batch) SCE
FROM dbo.StateChangeEvent SCE
JOIN dbo.State S WITH(NOLOCK)
ON SCE.[StateId] = S.[StateId]
WHERE TimeGenerated < @GroomingThresholdUTC
AND S.[HealthState] in (0,1)
END
SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
END
/*
UPDATE dbo.PartitionAndGroomingSettings
SET GroomingRunTime = @TimeGroomingRan,
DataGroomedMaxTime = @MaxTimeGroomed
WHERE ObjectName = ‘StateChangeEvent’
*/
SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
Success_Exit:
PRINT ‘Success’
RETURN 0
Error_Exit:
PRINT ‘Error’
RETURN 1
END
The data warehouse story
Sadly this was not all, we had the same issue on the data warehouse side. Too much state change events in the raw table, so much the standard aggregation would never complete. Tons of HealthServiceModules/Data warehouse errors in the event log with ids 31552, 31553 and so on.
What will follow is pretty generic and can be applied to any standard data set maintenance error. We wanted to be sure so we approached the whole stuff from scratch, even if we were sure this issue was related to the state change storm we had.
You must first check the exact dataset involved with this simple query (you must change the guid with the one reported in the health context, in my case ‘{821BEE3F-BE84-51F1-49C2-C87D93494D42}’)
select DS.SchemaName, DS.LastOptimizationActionDateTime, DS.LastOptimizationActionSuccessfulCompletionDateTime, SDAS.BaseTableName, SDAS.AggregationTypeId, SDA.LastGroomingDateTime, SDA.BuildAggregationStoredProcedureName, SDA.GroomStoredProcedureName
from dbo.StandardDataset DS
inner join dbo.ManagedEntity ME on ME.Name = DS.DatasetId
inner join dbo.StandardDatasetAggregation SDA on SDA.DatasetId=DS.DatasetId
inner join dbo.StandardDatasetAggregationStorage SDAS on SDAS.DatasetId=DS.DatasetId
where ManagedEntityGuid='{821BEE3F-BE84-51F1-49C2-C87D93494D42}’
Once you know for sure the dataset you can identify the bad data (in our case it was a state change issue, so we used the vStateRaw view to check for guilty monitors):
select top 10 ManagedEntityMonitorRowId, count(*) from State.vStateRaw
group by ManagedEntityMonitorRowId
order by COUNT(*) desc
You can then ask for more info on the monitors (you could as well run a single query with the appropriate joins), and you should try to have some more information about the type of failure, to do this the easiest way is to set the DebugLevel (=5 seems to log enough) field in the StandardDateSet table for the dataset you’re interested in, then you can poll the DebugMessage table to see what’s going on; or you can simply run manually the dataset maintenance and check the results. Remember to reset the DebugLevel field once you have collected enough data.
select * from dbo.vManagedEntityMonitor MEM
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=MEM.ManagedEntityRowId
where MEM.ManagedEntityMonitorRowId=511195
or in just one step
select top 10 S.ManagedEntityMonitorRowId, M.MonitorDefaultName, Count(*) from State.vStateRaw S
inner join dbo.vManagedEntityMonitor MEM on MEM.ManagedEntityMonitorRowId=S.ManagedEntityMonitorRowId
inner join dbo.vMonitor M on M.MonitorRowId = MEM.MonitorRowId
group by S.ManagedEntityMonitorRowId, M.MonitorDefaultName
order by COUNT(*) desc
Now that you have the troublesome ManagedEntityMonitorRowId(s) you can start, if you want, to delete the state change events from the raw tables, for example:
delete State.StateRaw_5E20669A64C84E06964070CA3670A1EA where ManagedEntityMonitorRowId=511195
you should repeat the process for every raw table involved, or you can use a cursor, but since they’re few we went manually. This will delete all the state change events for that monitor(s), this can be a little too much. We took the approach to identify what’s was going wrong, when and then selectively delete the bad data. Since our error was a timeout error during aggregation (we got this info from the DebugMessage table) we could have taken a couple of different approaches:
- stop the RMS and manually execute the dbo.StandardDatasetMaintenance (with the proper datasetid, in our case ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9’) until all the aggregations are completed successfully (see below). This is the lossless approach but forces a stop of you OpsMgr infrastructure.
- once the problematic time window has been identified you can force the aggregation to skip setting DirtyInd = 0 and AggregationCount=9 in table StandardDatasetAggregationHistory for the dataset and time window you wanna skip. This way you skip the aggregation of the troublesome timewindow for *all* the monitors (uhm probably not a way you want to go).
All in all, we decided to selectively delete the bad data, and this is the approach I recommend (remember, not supported). So, first check for the problematic time windows (or the latest never aggregated where the entire process is stuck):
–oldest never aggregated, (our dataset id was ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9’)
SELECT TOP 1
StandardDatasetAggregationHistoryRowId
,ah.AggregationTypeId
,ah.AggregationDateTime
,DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime)
FROM StandardDatasetAggregationHistory ah
JOIN StandardDatasetAggregation a ON ((a.DatasetId = ah.DatasetId) AND (a.AggregationTypeId = ah.AggregationTypeId))
WHERE (ah.DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9’)
AND (DATEADD(minute, a.AggregationIntervalDurationMinutes + a.AggregationStartDelayMinutes, ah.AggregationDateTime) < GETUTCDATE()) — aggregation interval + slight delay is in the past
time + slight delay is in the past
AND (DirtyInd = 1)
AND (ah.AggregationCount = 0) — interval was never aggregated
ORDER BY ah.AggregationTypeId ASC, ah.AggregationDateTime ASC
check for problematic monitors in the time window (for example 20:00 to 21:00)
select top 10 S.ManagedEntityMonitorRowId, M.MonitorDefaultName, Count(*) from State.vStateRaw S
inner join dbo.vManagedEntityMonitor MEM on MEM.ManagedEntityMonitorRowId=S.ManagedEntityMonitorRowId
inner join dbo.vMonitor M on M.MonitorRowId = MEM.MonitorRowId
where DateTime between ‘2010-06-09 20:00:00.000’ AND ‘2010-06-09 23:00:00.000’
group by S.ManagedEntityMonitorRowId, M.MonitorDefaultName
order by COUNT(*) desc
And finally, if appropriate, delete the guilty monitors in that timeframe
delete State.StateRaw_5E20669A64C84E06964070CA3670A1EA
where ManagedEntityMonitorRowId in (511194,511192,511191)
and DateTime between ‘2010-06-09 20:00:00.000’ AND ‘2010-06-09 21:00:00.000’
Another query you could find interesting is the following where you can identify the oldest dirty aggregation, a dirty aggregation is somewhat an aggregation that needs some more processing:
— oldest dirty
SELECT TOP 1
StandardDatasetAggregationHistoryRowId
,ah.AggregationTypeId
,ah.AggregationDateTime
,DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime)
FROM StandardDatasetAggregationHistory ah
JOIN StandardDatasetAggregation a ON ((a.DatasetId = ah.DatasetId) AND (a.AggregationTypeId = ah.AggregationTypeId))
WHERE (ah.DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9’)
AND (DATEADD(minute, a.AggregationIntervalDurationMinutes + a.AggregationStartDelayMinutes, ah.AggregationDateTime) < GETUTCDATE()) — aggregation interval + slight delay is in the past
AND (DATEADD(minute, a.AggregationStartDelayMinutes, ISNULL(ah.DataLastReceivedDateTime, DATEADD(minute, a.AggregationIntervalDurationMinutes, ah.AggregationDateTime))) < GETUTCDATE()) — data last received time + slight delay is in the past
AND (DirtyInd = 1)
AND (NOT EXISTS (SELECT *
FROM StandardDatasetAggregationHistory
WHERE (DatasetId = ‘1971E025-D6A7-4D1A-8E1A-E1F294CA44F9’)
AND (AggregationTypeId = ah.AggregationTypeId)
AND (DirtyInd = 1)
AND (AggregationDateTime < ah.AggregationDateTime)
)
) — this is the oldest aggregation which is dirty
ORDER BY ah.AggregationTypeId ASC, ah.AggregationDateTime ASC
Conclusions
This is it, it took half a day to get rid of this nasty issue. The troubleshooting steps can be applied to alerts, events and performance datasets as well. Hope you never have such a problem, but in case these steps should help you recover the OpsMgr infrastructure.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
#1 by claudiog64 on December 13, 2012 - 10:20 pm
Daniele, as always .. great stuff !
:-)
#2 by HC on December 14, 2010 - 9:18 am
We have had a big problem running “StandardDatasetMaintenance” manually, and have created a Microsoft case – but it ended up that this fantastic blog solved the problem for us.
Thank you very much!!!!!
I have explained our problem and how we solved it on this blog http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx
Please see my comment (HC 14 Dec 2010 12:14 AM) in the bottom.
Again thank you a lot for the help!!
#3 by Jonathan Almquist on October 1, 2010 - 4:41 pm
Very good, Daniele. Not sure how I managed to miss this post back in June. I happened to follow the bread crumbs from your latest EX2010 post and stumbled across this one. This goes beyond the standard state change event mitigation steps that I follow (haven’t groomed the DW previously), but I can see why we would want to do this, especially in the case of frequent state change storms.