As you may recall I blogged about an issue with disk performance counters on Windows Server 2008 and Windows Server 2008 R2. Basically the LogicalDisk sec/transfer counter was pretty unusable on those systems, see Disk performance reporting.
Obviously this matters only if you use such a counter as a KPI for your disks, we’re using it since we think it is the first stop counter, we used it with success on Windows 2003 and we were expecting to be able to use it on Windows 2008.
If you want to know if you’re affected by the same issue (please let me know if this is the case) you can run the following query against your Data Warehouse:
select DATEPART(dd, PH.DateTime) ‘Day’, COUNT(*) ‘Bad hourly avgs’
from Perf.vPerfHourly PH
inner join dbo.vPerformanceRuleInstance PRI on PH.PerformanceRuleInstanceRowId=PRI.PerformanceRuleInstanceRowId
inner join dbo.vPerformanceRule PR on PRI.RuleRowId=PR.RuleRowId
inner join dbo.vRule R on PRI.RuleRowId=R.RuleRowId
inner join dbo.ManagedEntity ME on PH.ManagedEntityRowId=ME.ManagedEntityRowId
AND PH.DateTime> DATEADD(dd,-7,getutcdate())
AND AverageValue > 1 group by DATEPART(dd, PH.DateTime)
The query returns the number of hourly averages above 1 second response time per day in the last 7 days, definitely an improbable average since disks are supposed to respond in terms of milliseconds not in seconds.
Last week I eventually find the time to give this issue a closer look. What I had found it is not a definitive solution but I got close to something usable, even if not the way I was expecting. My first check was to discriminate between a performance counter issue vs an OpsMgr issue. I started collecting performance counters with perfmon every 30" on the most problematic systems meanwhile watching what was being collected by OpsMgr. With my surprise I wasn’t able to reproduce the issue even if those systems ware pretty faulty in terms of data collected. After a few days of try and catch on different systems I started to suppose the issue could be frequency related. The OpsMgr rule polls the counter every 5 minutes while I scheduled perfmon to collect data every 30".
Firstly I tried to simulate the behavior with perfmon, setting the collection interval to 5 minutes, but I didn’t get any evidence, no issues detected on both sides (perfmon and OpsMgr). Secondly I tried to deploy a dumb rule that will simply poll the perf counter every 30 seconds and then do nothing (just to keep the perf counter alive). After a few days I run the previous query (the rule has been deployed in the evening of day 17) as you can see this simple polling reduces by a 100 factor the number of bad averages. The remaining ones are concentrated on a few DPM servers and on a handful of virtual machines, I can only suspect the spikes are related to VSS operations and host based VM backups (but I don’t have the time to investigate further right now).
|Day||Bad hourly avgs|
In conclusion, implementing the poller rule makes the collected data much more reliable, if I want to completely get rid of bad data I can just implement an intelligent datasource that strips away the bad data, and indeed I implemented it with success in my lab, but in my production environment I rather prefer to use the standard rules and just add a tiny simple poller.
If you have the same issue and want to give it a try just create an empty management pack and define a custom WriteActionModule, it mimics the standard pass-through probe and in this effect it is not a writeaction at all since it doesn’t change anything, but rules need writeactions and so give ’em the write action:
<WriteActionModuleType ID="QND.Library.PassThrough.WA" Accessibility="Public" Batching="false">
In the same MP you can then define the rule:
<Rule ID="QND.Windows.Server.2008.LogicalDisk.AvgDiskSecPerTransfer.Poller" Enabled="onEssentialMonitoring"
<DataSource ID="PerformanceDS" TypeID="SystemPerf!System.Performance.DataProvider">
<CounterName>Avg. Disk sec/Transfer</CounterName>
<WriteAction ID="Null" TypeID="QND.Library.PassThrough.WA" />
And that's all folks, now my reports are returning meaningful data once again.
This posting is provided "AS IS" with no warranties, and confers no rights.