Still fighting against bad performance data… but getting some results


As you may recall I blogged about an issue with disk performance counters on Windows Server 2008 and Windows Server 2008 R2. Basically the LogicalDisk sec/transfer counter was pretty unusable on those systems, see Disk performance reporting.

Obviously this matters only if you use such a counter as a KPI for your disks, we’re using it since we think it is the first stop counter, we used it with success on Windows 2003 and we were expecting to be able to use it on Windows 2008.

If you want to know if you’re affected by the same issue (please let me know if this is the case) you can run the following query against your Data Warehouse:

select DATEPART(dd, PH.DateTime) ‘Day’, COUNT(*) ‘Bad hourly avgs’
from Perf.vPerfHourly PH
inner join dbo.vPerformanceRuleInstance PRI on PH.PerformanceRuleInstanceRowId=PRI.PerformanceRuleInstanceRowId
inner join dbo.vPerformanceRule PR on PRI.RuleRowId=PR.RuleRowId
inner join dbo.vRule R on PRI.RuleRowId=R.RuleRowId
inner join dbo.ManagedEntity ME on PH.ManagedEntityRowId=ME.ManagedEntityRowId
where R.RuleSystemName=’Microsoft.Windows.Server.2008.LogicalDisk.AvgDiskSecPerTransfer.Collection’
AND PH.DateTime> DATEADD(dd,-7,getutcdate())
AND AverageValue > 1 group by DATEPART(dd, PH.DateTime)

The query returns the number of hourly averages above 1 second response time per day in the last 7 days, definitely an improbable average since disks are supposed to respond in terms of milliseconds not in seconds.

Last week I eventually find the time to give this issue a closer look. What I had found it is not a definitive solution but I got close to something usable, even if not the way I was expecting. My first check was to discriminate between a performance counter issue vs an OpsMgr issue. I started collecting performance counters with perfmon every 30" on the most problematic systems meanwhile watching what was being collected by OpsMgr. With my surprise I wasn’t able to reproduce the issue even if those systems ware pretty faulty in terms of data collected. After a few days of try and catch on different systems I started to suppose the issue could be frequency related. The OpsMgr rule polls the counter every 5 minutes while I scheduled perfmon to collect data every 30".

Firstly I tried to simulate the behavior with perfmon, setting the collection interval to 5 minutes, but I didn’t get any evidence, no issues detected on both sides (perfmon and OpsMgr). Secondly I tried to deploy a dumb rule that will simply poll the perf counter every 30 seconds and then do nothing (just to keep the perf counter alive). After a few days I run the previous query (the rule has been deployed in the evening of day 17) as you can see this simple polling reduces by a 100 factor the number of bad averages. The remaining ones are concentrated on a few DPM servers and on a handful of virtual machines, I can only suspect the spikes are related to VSS operations and host based VM backups (but I don’t have the time to investigate further right now).

Day Bad hourly avgs
10 1368
11 1347
12 1230
13 1291
14 1385
15 1318
16 1399
17 1144
18 104
19 60
20 90
21 76

 

In conclusion, implementing the poller rule makes the collected data much more reliable, if I want to completely get rid of bad data I can just implement an intelligent datasource that strips away the bad data, and indeed I implemented it with success in my lab, but in my production environment I rather prefer to use the standard rules and just add a tiny simple poller.

If you have the same issue and want to give it a try just create an empty management pack and define a custom WriteActionModule, it mimics the standard pass-through probe and in this effect it is not a writeaction at all since it doesn’t change anything, but rules need writeactions and so give ’em the write action:

      <WriteActionModuleType ID="QND.Library.PassThrough.WA" Accessibility="Public" Batching="false">

        <Configuration />

        <ModuleImplementation Isolation="Any">

          <Native>

            <ClassID>C6410789-C1BB-4AF1-B818-D01A5367781D</ClassID>

          </Native>

        </ModuleImplementation>

        <OutputType>System!System.BaseData</OutputType>

        <InputType>System!System.BaseData</InputType>

      </WriteActionModuleType>

 In the same MP you can then define the rule:

       <Rule ID="QND.Windows.Server.2008.LogicalDisk.AvgDiskSecPerTransfer.Poller" Enabled="onEssentialMonitoring"

            Target="Windows2008!Microsoft.Windows.Server.2008.OperatingSystem">

        <Category>PerformanceCollection</Category>

        <DataSources>

          <DataSource ID="PerformanceDS" TypeID="SystemPerf!System.Performance.DataProvider">

            <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>

            <CounterName>Avg. Disk sec/Transfer</CounterName>

            <ObjectName>LogicalDisk</ObjectName>

            <InstanceName></InstanceName>

            <AllInstances>true</AllInstances>

            <Frequency>30</Frequency>

          </DataSource>

        </DataSources>

        <WriteActions>

          <WriteAction ID="Null" TypeID="QND.Library.PassThrough.WA" />

        </WriteActions>

      </Rule>

 And that's all folks, now my reports are returning meaningful data once again.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

 
Advertisements
  1. #1 by Ben on February 21, 2011 - 9:24 pm

    • #2 by Daniele Grandini on February 22, 2011 - 9:56 am

      Hi Ben,
      yes this seems the case, the fix has been released on Feb 10th and I completely missed it. From the description we can infer that many counters collected by OpsMgr are affected by this issue. Right now the x64 version is still not downloadable from support.microsoft.com. I will definitely add the fix to the required ones.

      Thanks
      Daniele

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: