The quest for agent health – agent cpu usage unreliability


As you know, one of my primary focus is on OpsMgr agent health, all we know MPs quality could be better and that there’s no limit in the smart monitoring rules we can writ, but if the infrastructure is having issues all your monitoring efforts are useless. My two goals for agent health are:

  • be sure agents are posting data
  • be sure agents are not taxing too much the monitored system

    Before the release of the Operations Manager management pack version 6.1.7672.0 (if I can remember right) there was no useful control on agent cpu usage so we developed our own monitors for healthservice and monitoringhost cpu utilization. With the release of 6.1.7672.0 a comprehensive monitor on CPU usage has been introduced, it sums the impact of healthservice, monitoringhost and the spawned processes (i.e. cscript.exe, cmd.exe, powershell, …). About a month ago we decided to switch from our monitors to the Microsoft ones, this makes a lot of sense: no more custom monitors to take care of and a more comprehensive check that includes spawned processes.

    Immediately we started to have several alerts on agents using too much CPU, most of them were virtual machines and the reported cpu utilization was alarming (well above 20% on average). After some analysis using performance monitor (which I still consider super partes and the reference source of information for performance counters) we concluded these were just false positives.

    It was time for a check of the code used to get the CPU utilization. There I found a couple of major errors and a nasty backlash caused by WMI:

    • the script doesn’t exclude itself from the computation, this can lead to over estimated cpu utilization, since it’s clear that the script is using cpu cycles right when it is getting the counters
    • the script excludes 0 valued iterations from the average, this once again pumps up the average
    • the Win32_PerfFormattedData_PerfProc_Process class results, in my observations, disagree with Perfmon ones

    It doesn’t exclude itself:

    Do While childFound = True childFound = False

        For Each oProcess in oProcessList

            If IsObjectUnallocated(oProcess) = False Then

                If (InStr(agentProcIDs, "|" & oProcess.ParentProcessId & "|") > 0) AND (InStr(agentProcIDs, "|" & oProcess.ProcessId & "|") = 0) Then

                    WScript.Echo "Adding child process: " & oProcess.Name & " – " & oProcess.ProcessId

                    agentProcIDs = agentProcIDs & oProcess.ProcessId & "|"

                    childFound = True

                End If

            End If

        Next

    Loop

    it averages only >0 data point:

    ‘ Add the total percentage time to the final percentage time for averaging in the end

    If totalPercentProcessorTime > 0 Then

    finalPercentProcessorTime = finalPercentProcessorTime + totalPercentProcessorTime

    dataCount = dataCount + 1

    End If

    It uses the Win32_PerfFormattedData_PerfProc_Process wmi class: Set objRefreshableItem = objRefresher.AddEnum(oWMIService , "Win32_PerfFormattedData_PerfProc_Process")

    I generally assume PerfMon as a reliable source for performance data, so I compared the collected data on two busy agents for a 24 hours period:

  • The standard SCOMpercentageCPUTimeScriptProbe returned an average of 11.17% CPU usage on 5′ polling
  • Perfmon (including healthservice + monitorhost instances + cscript instances) 2.73% CPU usage on 2′ polling
  • My own monitoring probe that excludes itself from the computation 1.78% CPU Usage on 5′ polling

    To make it clearer take a look at the following chart:

  • the blue line is the standard/Microsoft agent cpu usage
  • the red line is my own revised rule

    image

    The following chart adds the healthservice (green) and monitoringhost (orange) processes cpu usage to the picture, as you can see they’re negligible, so the much part of agent cpu usage on this system is related to spawned processes (in this specific case to cscript.exe related to the SQL MP).

    image

    The conclusion is this probe is just unreliable and prone to false positives. So instead of turning off our own monitors, we just integrated the idea of child processes sum up in our owns. It has been simple since we already used a similar logic to get CPU usage for the healthservice, if you’re interested these are the mods you need to apply to the Microsoft standard probe (SCOMpercentageCPUTimeScriptProbe)

    First add a way to find the process id of the current script (alas cscript doesn’t have a simple way to do that)

            Set processes = oWMIService.execQuery("select ProcessID, Caption, CommandLine from win32_process")

            For Each process In processes

                with process

                    If .caption="cscript.exe" and InStr(.CommandLine, WScript.ScriptName) > 0 Then

                    wscript.echo .ProcessID, .Caption, .CommandLine

                    myProcessId = .ProcessID

                    end if

                End with

            Next

    Then add the logic for excluding the process to the process list

    Do While childFound = True

    childFound = False

    For Each oProcess in oProcessList

    If IsObjectUnallocated(oProcess) = False Then

    ‘ If parent process is in the agentProcIDs list but the process itself is not, its a new child

                ‘Filter out myself

                If oProcess.ProcessId <> myProcessId Then

    If (InStr(agentProcIDs, "|" & oProcess.ParentProcessId & "|") > 0) AND (InStr(agentProcIDs, "|" & oProcess.ProcessId & "|") = 0) Then

                        WScript.Echo "Adding child process: " & oProcess.Name & " – " & oProcess.ProcessId

    agentProcIDs = agentProcIDs & oProcess.ProcessId & "|"

    childFound = True

    End If

                end if

    End If

    Next

    Loop

    Remove the check on 0 valued data points

    ‘ Add the total percentage time to the final percentage time for averaging in the end

             ‘ there’s no reason 0 values must be excluded from the average

    ‘If totalPercentProcessorTime > 0 Then

    finalPercentProcessorTime = finalPercentProcessorTime + totalPercentProcessorTime

    dataCount = dataCount + 1

    ‘End If

    Finally if you want to, change the wmi performance source from "Win32_PerfFormattedData_PerfProc_Process" to the following function

    Function GetProcessorTime( ProcID, objService )  

    Dim N1, D1, N2, D2, Nd, Dd, PercentProcessorTime  

    Dim objInstance1, objInstance2 

    On Error Resume Next           

    For Each objInstance1 in objService.ExecQuery("Select * from Win32_PerfRawData_PerfProc_Process where IDProcess = ‘" & ProcID & "’")         

       N1 = objInstance1.PercentProcessorTime          

       D1 = objInstance1.TimeStamp_Sys100NS          

    Exit For      

    Next     

    WScript.Sleep(1000)         

    For Each objInstance2 in objService.ExecQuery("Select * from Win32_PerfRawData_PerfProc_Process where IDProcess = ‘" & ProcID & "’")          

       N2 = objInstance2.PercentProcessorTime          

       D2 = objInstance2.TimeStamp_Sys100NS          

    Exit For      

    Next      

    ‘ CounterType – PERF_100NSEC_TIMER_INV      

    ‘ Formula – (1- ((N2 – N1) / (D2 – D1))) x 100       

    Nd = (N2 – N1)       

    Dd = (D2-D1)        

    PercentProcessorTime = ( (Nd/Dd))  * 100   

    if Err.Number <> 0 Then

    GetProcessorTime = 0

    else

    GetProcessorTime = Round(PercentProcessorTime ,3)           

    end if

    End Function

    – Daniele

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Advertisements
    1. #1 by Jean on January 6, 2016 - 2:26 pm

      Hi Daniele, Do you still have this MP? Would it be possible to share once more?
      Many Thanks,
      Jean

    2. #4 by KeithK on May 4, 2012 - 4:29 pm

      Daniele, we are having issues with the “agent processor utilization” monitor and many false positives. The calculation for the “Collect agent processor utilization” rule is overstated. I was tempted to just override (targeting all agent managed computers) this monitor with the value of 75% just to provide some level of protection, until I came across your article about creating your own custom “agent processor utilization” collection rule while disabling the native one. I tried to access your mp on your skydrive, but could not access. Do you know if I can still access somewhere? Many thanks -Keith

      • #5 by Daniele Grandini on May 4, 2012 - 5:31 pm

        Hi Keith,
        you can find the MP here [https://www.sugarsync.com/pf/D6284134_0813286_45926] (I moved the repository off skydrive and I left back many broken links, sorry)

        Thanks for reading the blog.

        Daniele

    3. #6 by Chris Morgan on February 7, 2012 - 5:19 pm

      Daniele,

      Having the same issue . The link does not work … can you send working link or is this no longer available ?

    4. #8 by Jonathan on August 12, 2011 - 5:30 am

      Hi Daniele. In your testing, does including a sync time at 12:00 preclude any collections that would occur between HS start time and sync time?

      • #9 by Daniele Grandini on August 17, 2011 - 4:35 pm

        Hi Jonathan,
        it depends on the frequency, if I sync at 12:00 and have a frequency of 24 hours (86400 secs) then yes the script will wait until 12:00, but if the frquency is say 1 hour it will run at every hour span. This doesn’t take into account any on demand action that might have been defined.
        – Daniele

    5. #10 by Daniele Grandini on April 29, 2011 - 6:19 pm

      If you want to give a try to this solution I just posted a sample MP (http://cid-558ec647eef17f8d.office.live.com/self.aspx/.Public/Sample%20MPs/QND.HSCPU.xml) as usual no warranties and use it at your own risk.
      – Daniele

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: