DNS MP – guilty once again? | Quae Nocent Docent

DNS MP – guilty once again?

When a DNS Zone is Active Directory integrated and you have several DCs, the discovery for the DNS Domain generates useless traffic and high CPU usage. One of the discovered properties is the “Primary Server”, alas every DC for DNS integrated zones is the primary server for the zone, since the DNS Domain is discovered on every DC we have a race condition in which every DC may override the previous discovery. This in turn generates, as detailed by Fabrizio in a previous post, a configuration reload on every DC. The configuration reload, at the state, is a CPU intensive operation…

This is the definition of the DNS Domain class, as you can see the PrimaryServer is not a key, so every instance is identified by the DomainName property.

By default the discovery runs every 6 hours with no sync time, if, for example, the zone is distributed on 6 DCs we can have one config reload per DC per hour, more DCs mean more configuration reload. This can quickly become an issue. Once again a configuration reload is a CPU intensive operation for the Health Service process.

The DNS MP potentially has another issue, for each DNS Zone it discovers the zone SerialNumber. The SerialNumber is incremented by the DNS for every change in the zone, so we can assume it will change on every discovery cycle. From my first research this property is never used for monitoring purposes. Since the discovery runs every 6 hours it means a reload every 6 hours, this is not dramatic, but if bad MP coding habits sum up, then it can add to the health service total cpu usage, bringing the net impact of monitoring to an unacceptable level. We have DCs and ISA servers well above a 5% average cpu utilization for the HealthService process. With the previous DNS MP we had HS CPU usage over 10% on average.

My modest advice:

closely monitor instance space change every time you import a new MP (it can be done with a SQL Query). Frequent changes in a static environment means bad discovery processes.
define a custom rule to collect HS CPU Usage and report on it for average usage above 5%
override bad discovery scripts with more polite ones (as we did for ISA and DNS)
raise your voice and ask Microsoft for more testing efforts in MPs, patching and so on. We should concentrate on bringing value building monitoring blocks and reports for LOB apps not on fixing bugs in Microsoft code, shouldn’t we?

This entry was posted on January 19, 2009, 7:06 pm and is filed under Uncategorized. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by Robin Drake on February 14, 2009 - 10:45 pm

I’m unclear of the role of the DNS Discovery in the average 5% cpu usage if it only fires 1/12 hrs?

- #2 by Daniele Grandini on February 16, 2009 - 9:26 am
  
  Take a look at this post: https://nocentdocent.wordpress.com/2009/01/18/class-properties-that-get-updated-frequently-is-a-worst-practice-not-only-for-rms/
  
#3 by Daniele Grandini on February 14, 2009 - 10:13 pm

Hi Robin,
I fear we’re not talking the same tongue. I’m talking about HealthService CPU Usage on monitored servers not MS/RMS, if the HS uses more than 5% CPU time on *average* we have an issue, CPU are there to serve business processes not the monitoring stuff.
You’re right on the DW issue fixed with a post SP1 fix, but there’s a HS issue as well, as documented by Fabrizio in his post.
I can’t remember of having posted anything on points 1) and 2).

#4 by Robin Drake on February 14, 2009 - 9:51 pm

With reference to the comment about conditional forwarders :
1.) The conditional forwarder scipt does an nslookup -querytpe=a for the zone name on the targeted server. This of course doesn’t exist (that would be a stub zone) and is intended to fail so that the query is forwarded to the nominated name server. This is therefore a true test of forwarder availability.
2.) It is correct that the host name override is not used on a conditional forwarder – it always tests the name of the zone that is targeted. This override is only used with unconditional forwarders. All of this is by design as it allows a single script to be used for all the forwarder, response time, zone availability and server responsiveness monitors.

As regards the DNS discovery issue I believe this is scheduled by default for 1/12 hrs and I’m having trouble seeing why a 5% CPU increase is an issue – or did you mean on the RMS – but even then that’s what CPU’s are for? We have several MP’s where we deliberately update the discovery information on every run and so long as they aren’t too frequent there seems to be no problem. In dealing with over active discovery data during testing and devlopment we found the most serious issue was a jammed OpsMgrDW – not CPU. However by balancing out the collection frequency we have a stable environment and I belive the DW issue was fixed in SP1.

#5 by Daniele Grandini on February 14, 2009 - 11:03 am

Hi Dan,
your question deserves a somewhat articulated answer. Let’s start from the simple part, to make the DNS MP quieter we simply rewrote the discoveries and disabled the old ones, I have no difficulties to share this simple MP with you. (http://cid-558ec647eef17f8d.skydrive.live.com/self.aspx/.Public/Sample%20MPs/Progel.DNS.Overrides.xml) The net effect is a huge drop in CPU usage for HealthService. Compared with the old DNS MP, the new MP + our overrides brought the average CPU usage from 10-15% down to 3-4 %. The rewrite simply doesn’t discover anymore the proprieties with an high change ratio. I want to add that it’s useful to discover properties just for documentation and change tracking, but the current implementation level of the HealthService discourages it.
The point around agent CPU usage is thorny. Immo the monitoring net effect on agents varies from barely tolerable to intolerable in terms of CPU usage. On every monitored machine you must add healtservice.exe + monitoringhost.exe (in some cases several of them) + spawned processes + wmiprvse. HealthService is too much sensible to MP design, I can tax a monitored machine with my MP but I shouldn’t be able to compromise healthservice functionality. I guess with a little bit optimization it won’t be needed to reload the entire agent configuration when a single property of a single class instance changes its value. I understand the SCOM team has been in bug jail for so long but now it’s time to make things smoother and add more quality checks at least for Microsoft MPs (see my next post on KMS MP bug).
So we’re still fighting to find a good health check for the deployed agents, our short list right now is:

Average CPU usage under 5% on a daily timeframe

Alert on CPU Usage over 60% for 15 minutes

Patching level check not just on KB numbers but discovering key DLLs as well (you should know the mess around patching we had)

Check for a minimum data flow, we need to be sure events and performance points are flowing in. We had agents that were heartbeating but doing nothing else.
We’re still debating internally if we can live with a daily report or if we must develop some monitoring criteria as well.
Ciao

#6 by Dan Rogers on February 13, 2009 - 12:04 am

This is an interesting post. Daniele, what do you propose to change the behaviors to be more suitable? One approach might be to remove the properties that can change at the discovery interval (serial number, primary server name, etc). Would this make the grade?

You mentioned that you override these discoveries and add in a more modest one. Can you post that here so that others can see it? Also, for the change, what improves? CPU? IO?

As a more general question, you mention that 5% is a upper CPU threshold. Do others reading here buy into that threshold? The reason I ask is that if we were to add a monitor on agent CPU%, it might be interesting to look for a reasonable limit.

#7 by Ian on January 21, 2009 - 6:21 pm

How about forwarded zones, for a conditional forwarded zone SCOM attempts to resolve an A record for the zone label, which depending on the configuration of the zone may not exist. Even worse while there is an overridable option for the record to lookup, it isnt used for a conditional zone. not nice.

Also a test is carried for each DC’s, X each forwarded zone X each name server specified for that zone…. not nice x2.

#8 by Robin Drake on January 21, 2009 - 2:14 pm

I think each discovered zone has a unique identity which includes the name of the server on which the discovery was performed. Therefore although the zone ‘myZone.com’ is recorded many times (once for each DC) in reality each one is different having the full name ‘MyZone.com(DC1)’,’MyZone.com(DC2)’, etc. Each entry only ever has one Primary server – itself. Under these circumstances there should be no race. To see this create a view of DNS Zone state and add the appropriate columns (DNS MP v.6.0.6480.0 at 2008-12-19)

- #9 by Daniele Grandini on January 21, 2009 - 4:38 pm
  
  Hi Robin, my fault I talked about DNS Zones, but in fact I’m referring to DNS Domains. I will update the post to reflect the problem on DNS Domains and adding some considerations on SerialNumber discovery for each zone, it typically changes at every discovery.
  
  - #10 by Daniele Grandini on January 21, 2009 - 6:32 pm
    
    Post updated with more info