Troubleshooting Exchange 2010 MP aggregations


Here we are once again, many of the errors we found in MPs are related to poor "real world" testing. I guess the tester has an isolated environment where she/he tests only the specific MP and not the interactions with other MPs. The Exchange 2010 MP follows this rule.

Since I installed the MP I started to have Error 31552 in the OpsMgr Event log on my RMS

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    Data Warehouse
Event ID:    31552
Date:        9/30/2010
Time:        2:20:18 PM
User:        N/A
Computer:    XXXX
Description:
Failed to store data in the Data Warehouse.
Exception ‘SqlException’: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

One or more workflows were affected by this. 

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: Microsoft.Exchange.2010.Reports.Dataset.Availability
Instance ID: {93DE3BA6-9596-13CD-0B5F-4B86F1E52F50}
Management group: XXXX

Today I eventually find the time to give it a check, this is what I found.

I started my investigation following an old post from QND: Troubleshooting StateChangeEvent grow.

The case was very similar and after a few SQL queries it turned out aggregation for Exchange 2010 data was failing. As a first step I tried to manually run the aggregation stored procedure: Exchange2010.AvailabilityAggregate. It run for almost 2 hours to complete a one day aggregation, it is obvious it will timeout once executed inside the standard dataset maintenance workflow.

Since I just have one Exchange 2010 server in production something must clearly wrong, I started to tackle the stored procedure code to check where was the culprit. I save you all the stepping required to identify the issue. The stored procedure inserts all the state change events in a 24 hours time frame in a temporary table (the only aggregation available for Exchange is Daily) and then computes a complex query where the table is cross referenced 3 times. When I say all the state change events I mean exactly that, not just the Exchange related state changes, but all the changes in your Management Group. It happens than mine has on average between 200K to 2M state changes per day, of these just 100 to 200 are Exchange related (too much for a single server I know but there’s a nasty bug on CAS monitoring that flips flops, but this is another matter). So if you take 1M rows and then join them with another 1M rows and then filter on a query on 1M rows, I’m sure you can guess you could have some performance issues, don’t you?

image

This is where the SP selects all the State Change Events, as you can see the @ManagedEntityMonitorRowId parameter is null and then follows the big query with the multiple references to the #StateEvent table

image

The SP author here has been too confident in the SQL query optimizer, the human brain is still better. With a simple mod I’ve been able to reduce the execution time from almost 2 hours (this means it never completed inside the standard dataset maintenance workflow) to 40". The idea is simple, just select the relevant state changes and discard all the others, reducing the table of 4 order of magnitudes (from millions to hundreds).

The mod is pretty simple I tried to touch as few lines of code as possible, I just created an intermediate table with the relevant events using the same filtering of the query above.

image
I then substituted the new #StateEventF table to the aggregation query:

image

Obviously this is *totally unsupported by Microsoft* and you must do it at your own risk. I strongly recommend to rename the original stored procedure before trying to modify it.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

  1. #1 by naveed on June 17, 2012 - 8:23 am

    Dear Dan,
    i am getting event id 2115 for exchange Workflow Id : Microsoft.Exchange.2010.Reports.Transport.Hub.ServerMailflowStatisticsNew.DataCollection
    Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
    Workflow Id : Microsoft.SystemCenter.CollectPublishedEntityState
    Workflow Id : Microsoft.SystemCenter.CollectPerformanceData
    Workflow Id : Microsoft.SystemCenter.CollectPublishedEntityState
    Workflow Id : Microsoft.SystemCenter.CollectDiscoveryData

    this event are are in night during scom maintanance

    Regards
    Naveed

  2. #2 by Blake Mengotto on October 1, 2010 - 3:35 pm

    Great catch Dan! There are other issues with this mp that you will and or have already probably discovered. I believe the are releasing an update soon with the TAP ended and SP1 releasing/released.

  1. Troubleshooting Exchange 2010 MP aggregations in Operations Manager 2012 « Quae Nocent Docent
  2. A tale of Identity and bad agent behavior « Quae Nocent Docent

Leave a reply to naveed Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.