How to boost the performance widget performance (#sysctr #scom)


** 25th July 2013 UPDATE ** the issue has been fixed with Update Rollup 3, so happy dashboarding **

Dashboards are a new UI addition introduced with System Center 2012 Operations Manager (OpsMgr), the goal was to build a complete new UI with same look, feel and behavior on every console (fat, web and sharepoint). The goal was right it’s just the implementation that falls short.

Anyone who tried the dashboards outside a lab or poc environment experimented with the sluggishness of the performance related views (widget). While attending the MVP Summit (virtually) I took a couple of hours off to drill through the issue. My aim was to give a task based dashboard with the basic performance indicators:

clip_image002

The dashboard included:

          Cpu usage

          Memory usage

          The response time for all disks

          The bandwidth usage for all NICs

Alas, the dashboard was so slow to be unusable and sometimes some counters weren’t displayed at all.

To begin my troubleshooting effort I started from the MP definition (Microsoft.SystemCenter.Visualization.Library) to rapidly get to the SQL Server side: in fact all the data interaction is performed with the Data warehouse using newly defined stored procedures with the SDK schema/prefix. Time for a good SQL Profiler session, I had all the information I needed to setup a proper trace. What I found has been astonishing, tens even hundreds of calls like the following for every dashboard refresh:

exec SDK.Microsoft_SystemCenter_Visualization_Library_SinglePerformanceDataSeriesGet@ManagementGroup=‘6934C4FC-5C84-2C84-C5A0-88726754720D’,@StartTime=‘2013-02-17 14:57:26.730’,@EndTime=‘2013-02-20 14:57:26.730’,@ManagedEntityGuid=‘CD4CBEB2-85A0-0F82-DFC3-12186EDB7F5A’,@PerfRuleInstanceRowId=132633,@NumberOfDataPoints=100,@RequestedDataPointType=4

It was clear something was wrong (badly wrong), not only the perf widget is slow it taxes the SQL engine as well. At the beginning of the trace another stored procedure was called, this time 4 times with different parameters: clearly once for every widget in the dashboard.

exec SDK.Microsoft_SystemCenter_Visualization_Library_PerformanceCounterListByManagedEntityUsingContainerME@ManagementGroup=‘6934C4FC-5C84-2C84-C5A0-88726754720D’,@ContainerManagedEntityXml=N'<ManagedEntityIds>

  <ManagedEntityId>cd4cbeb2-85a0-0f82-dfc3-12186edb7f5a</ManagedEntityId>

</ManagedEntityIds>’,@ObjectNamePattern=N’Network Adapter’,@CounterNamePattern=N’PercentBandwidthUsedTotal’,@InstanceNamePattern=N’%’

Trying to execute each one of these showed the culprit:

          The memory one returned just one row in no time

          The CPU two rows

          The network about 200 rows in 30”

          The disc about 5100 rows in more than 3’. The dashboard are supposed to refresh every 60”, ahem.

What was happening is that the stored procedure returned one row for every instance (and rule, more on this later) of the requested performance counter. The dashboard logic in turn, for every single row, executed the SinglePerformanceDataSeriesGet, which retuned data only for the instances effectively present on the computer targeted by the dashboard (to be precise present on the targeted managed entity). This is what I call a bad (very bad) design.

Time for some serious TSQL code review for SDK.Microsoft_SystemCenter_Visualization_Library_PerformanceCounterListByManagedEntityUsingContainerME.

The results of the review are embarrassing, in summary:

          Management pack versioning was not considered so multiple performance rule instances are returned for the same counter. This happens when the MP defining the rule has been upgraded one or more time, quite a common scenario in a production environment.

          Managed entity life time was not considered, so dismissed managed entity were taken into account, contributing to execution time.

          All the performance rule instances were retuned even if such instance doesn’t exist on the target managed entity

          Lastly the logic never consider the chance to have multiple collecting rules for the same performance counter

Not bad for a single stored procedure.

Before going further, I want to make clear the scope of post: I just modified the stored procedure needed for my specific goal, there are other similar stored procedures that, I guess, suffer of the same issues. I’m going to explain how I modified the stored procedure, if you want to go this way you just remember it is at your own risk.

Lastly I won’t post the complete TSQL, you must go through the process of modifying the stored procedure by yourself. Remember any update to the MP will overwrite the stored procedure.

Step 1. Create a script for ALTER of the stored procedure

clip_image004

Step 2. In SQL management studio rename the original stored procedure so that you can restore it if needed.

clip_image006

Step 3. Ready to modify the script the following are the required mod, they are all tagged with [QND] with a short description.

The first correction is to the statement that populates all the possible targets for the given performance counter, I added the check on validity and removed the explosion for all the instances

TSQL

— Populate this table with the target types of the matching rules.

        — [QND] it’s too expesnive to include all the selected instances here so let’s skip it

        — first get the MG Row Id

        Declare @MGRowId int

        Select @MGRowId = MG.ManagementGroupRowId from dbo.ManagementGroup MG where MG.ManagementGroupGuid=@ManagementGroup

       

        INSERT INTO #RuleTargetTypesAndTheirBaseTypes

        SELECT RMV.TargetManagedEntityTypeRowId, PR.ObjectName, PR.CounterName, — [QND] PRI.InstanceName

                    , PR.RuleRowId — [QND] this will be used later to select proper instances PRI.PerformanceRuleInstanceRowId

                    , METMPV.AbstractInd

        FROM vPerformanceRule PR

        — [QND] JOIN vPerformanceRuleInstance PRI ON PR.RuleRowId = PRI.RuleRowId

        JOIN RuleManagementPackVersion RMV ON RMV.RuleRowId = PR.RuleRowId

            — [QND] added to take into account only valid rules i.e. not deleted / upgraded

                    AND EXISTS(select ManagementPackVersionRowId from dbo.ManagementGroupManagementPackVersion M2 where M2.ManagementPackVersionRowId=RMV.ManagementPackVersionRowId AND M2.LatestVersionInd=1)

 

        JOIN ManagedEntityTypeManagementPackVersion METMPV ON METMPV.ManagedEntityTypeRowId = RMV.TargetManagedEntityTypeRowId

        — [QND] added chck on MP validity for TARGET

        JOIN dbo.ManagementGroupManagementPackVersion MGMPV ON MGMPV.ManagementPackVersionRowId=METMPV.ManagementPackVersionRowId AND LatestVersionInd=1 AND MGMPV.ManagementGroupRowId=@MGRowId

       

        WHERE ((PR.ObjectName LIKE @ObjectNamePattern) AND

                        (PR.CounterName LIKE @CounterNamePattern) 

                        — [QND] AND (PRI.InstanceName LIKE @InstanceNamePattern)

                        )

The second attack point is at the final select where we need to perform several task:

          Filter only valid instances for the targets

          Remove duplicated collection rules

          Consider only valid objects just in case something has skipped from previous queries

TSQL

        — Select the appropriate columns from the below tables.

        — [QND] here things get tricky becasue there’s no metadata to join the performance rule instance to the managed entity the only place we have that data is in the performance data itself

        — so as a first measure we need to look for a match in one of the performance views looking back some x hours, Perf.PerfRaw seems a logical choice but it has a performance hit on the query

        — I choose to use the PerfHourly and lookup back up to 4 hours, if I have no data in the lst four haours I must assume the performance rule instance is not valid for the given ME

        — Second issue is we can have more than one rule for the same performance counter, so we must pick just one of them or we end with multiple data for the same instance in our dashbaord

        — the best choice would be to look for how many data points we have for each rule and chose the one with the max number of data points: more data point more precision.

        — This is far too expensive in performance hit, so I just chose to take the latest added rule using the MAX function

        — Third we want just valid objects so I added the TME.ToDateTime Is NULL clause

        SELECT DISTINCT #RTTT.ObjectName, #RTTT.CounterName

                                                      , PRI.InstanceName — [QND] #RTTT.InstanceName

                                                      , MAX(PRI.PerformanceRuleInstanceRowId)‘PerformanceRuleInstanceRowId’ — [QND] #RTTT.PerformanceRuleInstanceRowId 

                                ,ME.ManagedEntityGuid, ME.Path, ME.DisplayName

        FROM vManagedEntity ME

        JOIN #ContainedEntitiesTable #CET ON #CET.ContainedEntityRowId = ME.ManagedEntityRowId

        JOIN TypedManagedEntity TME ON ME.ManagedEntityRowId = TME.ManagedEntityRowId AND TME.ToDateTime Is NULL — [QND] mod we just want valid entities

        JOIN vManagementGroup AS VMG ON ME.ManagementGroupRowId = VMG.ManagementGroupRowId

        JOIN #RuleTargetTypesAndTheirBaseTypes #RTTT ON #RTTT.TypeRowId = TME.ManagedEntityTypeRowId

         — [QND] modded to get the correct Instance

         JOIN vPerformanceRuleInstance PRI ON #RTTT.PerformanceRuleInstanceRowId = PRI.RuleRowId AND(PRI.InstanceName LIKE @InstanceNamePattern)

 

        WHERE VMG.ManagementGroupGuid = @ManagementGroup

        — [QND] now filter out all non relevant instances using the POerfHourly table lookinup the latest 4 hours

                    AND EXISTS

             (Select Per.PerformanceRuleInstanceRowId FROM Perf.vPerfHourly PeR

                    WHERE PeR.PerformanceRuleInstanceRowId=PRI.PerformanceRuleInstanceRowId AND PeR.ManagedEntityRowId=#CET.ContainedEntityRowId AND Per.DateTime > DATEADD(HOUR,-4,GETUTCDATE()))

             group by  #RTTT.ObjectName, #RTTT.CounterName, PRI.InstanceName,ME.ManagedEntityGuid, ME.Path, ME.DisplayName

The net result is perf widgets dashboard have acceptable performances, now. For example the selection of disk instances who returned more than 5K rows in more than 3’ now for a specific server return 4 rows (one for each disk) in 5”, in turn this generates 4 stored procedure calls instead of the 5K of the previous implementation.

With such a design is no mystery (now) what performance dashboards are so slow.

– Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

Advertisements
  1. #1 by Satya on July 24, 2013 - 11:46 pm

    Daniele Grandini thank you for identifying the root cause of the performance widget performance issue. We took your feedback and suggestions and optimized the performance widget query. The fix is included in System Center 2012 SP1 UR3 which can be downloaded here http://support.microsoft.com/kb/2836751

    I hope your blog is now updated to say the product team incorporated your feedback in quick time :)

  2. #2 by Stephen Hull on May 14, 2013 - 7:14 pm

    Fantastic! What a great post! I am going to dig deeper into this as well.

  3. #3 by Danny McDermottDanny McDermott on March 8, 2013 - 7:28 am

    Great analysis!
    @Jose, I had a PSS call open with Microsoft regarding not returning the path in the Top N widget. They have said it is fixed in SP1 UR2. Hopefully it makes it though testing.

  4. #4 by Jon Sykes on March 7, 2013 - 3:18 pm

    Great article, have you submitted this back to the development team or through connect.microsoft.com? Microsoft dev. teams need to understand their shortcomings, just like any other developer. I recommend forwarding this information if you haven’t already. End user performance has a direct impact on the usability and confidence in a product.

    • #5 by Daniele Grandini on March 7, 2013 - 6:32 pm

      Well said Jon, and yes the team is aware of the performance issues with dashboards… hopefully they’ll find some resources to throw into it. If you have any spare CSS ticket it would be of great help to open a CSS incident, if they have many the dashboard priority is likely boosted.

      Thanks for reading my blog and commenting.

      Daniele

  5. #6 by Jose Fehse on March 4, 2013 - 4:19 pm

    Hi Daniele, thank you for that. I’ll sure give it a try, since I haven’t had anything but frustration with the new widgets. Some are still presenting issues, like the Objects by performance, which will show weird data or not show anything at all, like the path column, for instance.

    Thank you,

    Jose Fehse

    • #7 by Danny McDermottDanny McDermott on March 8, 2013 - 7:33 am

      Hi Jose, I had a case open with Microsoft PSS with regards to the path not being presented in the Top N perf widget. I have been told it will be fixed in SP1 UR2, pending successful testing, due for release in April. It was supposed to have been implemented in SP1, but got pulled for some reason.

  6. #8 by Jonathan Almquist on February 21, 2013 - 3:47 pm

    Thanks for posting, Daniele! I’ve seen these types of problems happening under the hood with other database operations, too. I’d have to dig up my notes, but helped a customer trace common views in the past and found the default view definitions pull way more data than is necessary to draw the view. Now, I always manually remove elements in view XML before releasing to production.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: