A tale of Identity and bad agent behavior


A couple of weeks ago my operational folks notified me of a strange new alert they need to deal with. When I took a look at the RMS I understood a new debugging session was about to start:

Alert: Data Warehouse object health state data writer process failed to perform maintenance operation

Source:

Path:

Last modified by: System

Last modified time: 5/19/2011 8:22:37 AM Alert description: Data Warehouse object health state data writer process failed to perform maintenance operation. Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.

Exception ‘SqlException’: Sql execution failed. Error 777971002, Level 16, State 1, Procedure StandardDatasetAggregate, Line 424, Message: Sql execution failed. Error 777971002, Level 16, State 1, Procedure StateAggregate, Line 341, Message: Sql execution failed. Error 8115, Level 16, State 1, Procedure -, Line 1, Message: Arithmetic overflow error converting IDENTITY to data type int.

This alert was hiding at least a couple of issues:

  1. one of the state table was doomed, the identity field has benne declared as an int but now it has grow to a level that precision is not enough
  2. something must be going wrong with state messages if we got to that point

It was clear that item 2. must be solved before tackling item 1., ’cause this is the root cause of the issue. Since this can become a common issue I share with you the troubleshooting steps needed to recover the data warehouse dataset aggregation processing.

Getting to the right table

The first step you need to take is getting to the right table starting with the info reported in the alert. I’ll show you the complete way event if using some approximation you can get to the right table simply using the second query.

select * from dbo.StandardDatasetAggregation SDA
inner join StandardDataSet SDS on SDS.DataSetId=SDA.DataSetId
where SDA.BuildAggregationStoredprocedureName = 'StateAggregate'

What’s really interesting from the results returned by the above query is the datasetId, with this information you can query the StandardDatasetAggregationHistory table to get the first failing aggregation for that specific dataset. Note – it is not granted the first aggregation is experiencing the identity issue, but it’s highly probable. If this is not the case you must repeat the process for every aggregation/partition until you find the troublesome one.

select * from dbo.StandardDatasetAggregationHistory SDA
inner join dbo.StandardDataset SD on SD.DatasetId=SDA.DatasetId
where DirtyInd=1 and SDA.DataSetId='ID returned by previous query'
order by AggregationDateTime DESC

 

You mast take note of the first row returned and check for the aggregation type and aggregation datetime.

Select *from dbo.StandardDatasetTableMap
where DatasetId=‘ID returned by previous query’
and AggregationTypeId=[Typereturned]–typically 20
and ‘AggregationDateTime returned’ between startDateTime and EndDateTime

If no row is returned then the troublesome table is the current one and you should query for

Select * from dbo.StandardDatasetTableMap
where DatasetId='ID returned by previous query'
and AggregationTypeId=[Typereturned] --typically 20
and InsertInd=1

 

Now you have the dataset and the table suffix so you can build the actual partition name and then check if this is for real the bad one. In my case I had an issue with the state dataset on an hourly aggregation (20), so my check has been:

select MAX(StateHourlyRowId) from [State].[StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD]

if it returns something near: 2^31-1 (2,147,483,647) then this is the table (hopefully the only one)

Checking what was going wrong

Now that we have the dataset and the table, we must first understand what’s going wrong. Has an agent submitted too much data in too few times so that the standard storage management hadn’t the time to allocate a new table fast enough? Is there some kind of data inconsistency that causes this bug? What else?

In my case there was an agent that was continuously submitting out of date state data. This caused the aggregation to run repeatedly until the identity column reached the maximum value for an int. If this is the case you can check for it using the query on the StandardDatasetAggregationHistory table, if  the DataLastReceivedDateTime for the first aggregation is now (or very close in the past) and the aggregation date time is a few days in the past then that’s it.

Next step is to check which is the bad agent, unfortunately the raw state table in the data warehouse doesn’t report the submitted time, so we must switch to the live database.

select * from dbo.StateChangeEvent SCE
inner join State S on S.StateId=SCE.StateId
inner join BaseManagedEntity BME on BME.BaseManagedEntityId=S.BaseManagedEntityId
where TimeGenerated < DATEADD(d,-1, getutcdate()) and TimeAdded > DATEADD(hh,-1, getutcdate())
order by TimeAdded DESC

 

The query checks for changes that have been submitted in the last hour but are referring to events more than one day old. This should not happen and in the list you can find the managed entity responsible for the data.

In my case I solved resetting (deleting) the local agent cache. Obviously batter data check in the core product would have avoided this issue and should have alerted about the agent submitting outdated data.

Resuming dataset aggregation

Once you removed the core cause, you must recover the table in overflow so that the aggregation process can recover and complete successfully. This is a normal TSQL reseed operation you can find many references on, anyway I’ll post the step by step statements. The  overall process is:

  1. rename the bad data table, from this time the aggregations will fail with a table not found error
  2. create a temporary table with the same structure (use a select into statement to be sure), it must have a different name from the original one so that the aggregation workflow cannot step in until we’re done
  3. truncate the original table and reseed it
  4. insert the saved data into the table
  5. delete the temporary table
  6. rename the table to the original name

Remember – the following statements will modify the data in you OpsMgr DBs so you’re using them at your own risk. They’re not supported unless you’re instructed to do so by MS PSS/CSS.

I will use my table name (State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD) in the following statements, you must change it with your own table.

exec sp_rename ' State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD', ' State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_'
select * into State.StateHourlyTemp from State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_
truncate table State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_
dbcc checkident ('[State].[StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_]', RESEED, 1)

insert into State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_
    (DateTime, ManagedEntityMonitorRowId, InRedStateMilliseconds, InYellowStateMilliseconds,
    InDisabledStateMilliseconds, InPlannedMaintenanceMilliseconds, InUnplannedMaintenanceMilliseconds,
    HealthServiceUnavailableMilliseconds, IntervalEndHealthState)
select DateTime, ManagedEntityMonitorRowId, InRedStateMilliseconds, InYellowStateMilliseconds,
    InDisabledStateMilliseconds, InPlannedMaintenanceMilliseconds, InUnplannedMaintenanceMilliseconds,
    HealthServiceUnavailableMilliseconds, IntervalEndHealthState
from State.StateHourlyTemp

/*
-- if the table is huge and you want to stage insertions you can use the following mechanism.
WHILE (SELECT COUNT(*) FROM State.StateHourlyTemp) > 0
BEGIN
INSERT TOP(10000) INTO State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_
    (DateTime, ManagedEntityMonitorRowId, InRedStateMilliseconds, InYellowStateMilliseconds,
    InDisabledStateMilliseconds, InPlannedMaintenanceMilliseconds, InUnplannedMaintenanceMilliseconds,
    HealthServiceUnavailableMilliseconds, IntervalEndHealthState)
SELECT 
DateTime, ManagedEntityMonitorRowId, InRedStateMilliseconds, InYellowStateMilliseconds,
    InDisabledStateMilliseconds, InPlannedMaintenanceMilliseconds, InUnplannedMaintenanceMilliseconds,
    HealthServiceUnavailableMilliseconds, IntervalEndHealthState
FROM State.StateHourlyTemp
DELETE TOP(10000) FROM State.StateHourlyTemp
END */
drop table State.StateHourlyTemp
exec sp_rename 'State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD_', 'State.StateHourly_E0C7AD12B1034074ABE7DF97B31D4AFD'

 

Once you’ve finished reseeding the table the aggregation process will resume and slowly catch up. If you want to give it a boost you can use the following loop, once again not supported it can have a performance hit on your infrastructure:

while (select COUNT(*) from dbo.StandardDatasetAggregationHistory SDA
inner join dbo.StandardDataset SD on SD.DatasetId=SDA.DatasetId
where DirtyInd=1 and SDA.DataSetId='dataset GUID' and SDA.AggregationDateTime <= DateAdd(d,-1,getudate()) )>0
begin
    exec StandardDatasetAggregate 'dataset GUID'
    waitfor delay '00:01'
end

 

References

SCOM Data Warehouse grooming

Troubleshooting StateChangeEvent grow

Troubleshooting Exchange 2010 MP aggregations

– Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: