The importance of trends in performance monitoring


In my experience performance monitoring is always tricky. Unless you get to some knee points (CPU usage over 80%, disk utilization over 70% to mention a few) performance is always subjective, even if 5 seconds is acceptable for a user it can be unacceptable for another.

Nevertheless performance monitoring can be the key to avoid potential issues before they impact the business. We developed trends report based on the tons of performance data OpsMgr collects. We use them to have pre alert notifications or, if you want, real proactive monitoring. There’s gold in the OpsMgr data warehouse. I want to share with you the latest example on how trend monitoring can help in correcting issues before they impact your business.

One of the key indicators that we track is obviously CPU usage, we check for significative differences between the last week average, last 2 weeks and last month ones. Not only, we check for "abnormal" CPU usage, where with abnormal we define something that is not yet an alert (for us CPU over 80% for a sustained time period) but that’s not common (our reference today is over 50% usage on a daily average). Consider a run away process with a single thread on a 4 cores system, it will get 25% of CPU time, you won’t be alerted but you have an issue.

In last week report I got a bad CPU figure from a recently installed file system, we say a huge increase in CPU usage compared to previous week and month (from 7% to 40% on average). We didn’t receive any complaints from the customer for the previous week, so he’s not even aware of this behavior. Using OpsMgr console task to list the top CPU processes we found that the DFSr service was using 30% to 40% of CPU.

Time for a more advanced troubleshooting with a quick procexp drill through

clip_image002

Inside the DFSr sevice the tracing threads were the top consumers, clearly this is not what I consider a normal behavior. Some KB queries and we found a recommended fix for DFSr (http://support.microsoft.com/kb/979524/en-us) none of the symptoms applied to our situation but sure the DFSr has been under stress from massive file changes, we tried the fix and the fix worked.

This is just one example on how historical performance data collected in the OpsMgr data warehouse can be of tremendous help in fixing issues before they hit the users. I encourage everyone the review on a fixed schedule the key performance indicators for your apps and systems, you’ll find gold in your data warehouse and you’ll be able to justify the cost associated with keeping it running.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: