Quae Nocent Docent

What hurts, teaches – Ordinary tales from management trenches

Disk performance reporting

leave a comment »

Disk performance reporting and trending is probably the most difficult part of performance troubleshooting and capacity planning and trending. During the years I used several counters just to understand none of them can give a synthetic and accurate answer on disk performance. Ever tried to user % idle time or % disk time? Or Avg queue length if that matters. In the last few years I standardized on Avg Disk Sec / read / Write / Transfer as a single indicator of disk responsiveness. Generally I take 20ms to 30ms as a threshold should not  be exceeded on average.

However, even these counters are prone to errors (I don’t know when the OS team will change the perf counters architecture, but it will never be too early). First of all you must be aware of the following issues that arise with virtualized Windows 2003 servers with more than one core:

But issues are there to hit on Windows 2008 as well. Look at the following table that reports Avg Disk sec / Transfer on a Windows 2008 server:

image

As you can see, among “normal” values (0,xxx , is the decimal separator in Italy), we have clearly bad ones. It is obvious the server is not taking 93” or 3139” seconds (little less than 1 hour) to execute an I/O on average. The presence of these bad values can wrack havoc your reporting experience, the average of the above values is 285”, now it is clear this cannot be the case.

I didn’t find a root cause for this behavior. I can observe it on several Windows 2008 servers with a predominance of hyper-v hosts, it can be hardware related or just a bug in the perf counter (immo both), in any case your reports are doomed.

The only thing I can advice on is to change your SQL query to filter out obviously bad values. For example filtering out response time above 2 seconds, changes my average on the period to 0.029 seconds or 29 msec that denotes a fairy busy storage subsystem.

If I manage to find more info on this issue I’ll keep you posted, in the meantime take your reports on disk response time with a grain of salt.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

October 31, 2009 at 12:11 pm

Posted in Bug, Reporting, SCOM

Leave a Reply