Troubleshooting 21025 events – Part 1 evidence


I’m still struggling with what I consider abnormal CPU usage for my agents and RMS, I found out I’m not alone (is-your-rms-updating-configuration-too-frequently.aspx) so I think it’s time to share what I discovered from my investigations in OpsMgr 2007 R2 RTM code. This first post will focus on 21025 events we talked about them in the past (Class properties that get updated frequently is a WORST PRACTICE not only for RMS) and recently (How to get noisy discovery rules). A word of caution the following are my observations, I don’t have access to the source code nor to debugging symbols, even if I double checked every sentence I can be wrong and surely I cannot cover every possible scenario. So what are OpsMgr Connector 21025 events?

image

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

21025 events are logged every time the HealthService needs to update its configuration, possible causes:

  1. a new MP is imported and has workflows that target one of the instances discovered on the HealthService
  2. an override has been defined on a workflow that targets one of the *classes* discovered on the HealthService
  3. a new instance has been discovered or an already discovered property changes its value (it takes about 50” after the config change to have a 21024 event see below)
  4. other, I’m still investigating 21025 causes on the RMS that seems not to be related to one of the above causes

With instance I mean a specific discovered object, say Logical Disk C:. With class I mean a generic object type, say Logical Disk. So from the above causes we can infer, among the others, that:

  1. If I import a new MP with discovery workflows targeted to Windows.Computer all my agents will get this and fire a 21025 *even if* the discovery rules are disabled. We can assume every agent has a Windows.Computer instance in its config space.
  2. If I enable a discovery rule for one of my Windows.Computer (a specific instance) never the less all the Windows.Computer will get the override and fire a 21025

** room for optimizations Marius :-) **

Remember RMS “hosts” non hosted classes and watchers so any change to those will fire a 21025 on the RMS, but this is just a particular case of cause 3.

With R2 I had no evidence that a changed discovered property on a single agent, nor a new object instance discovered, would cause a 21025 on the RMS. I run a complete test suite on the topic that lead me to the above conclusions, and *never* a changed property or a newly discovered *hosted* instance has caused a 21025 on the RMS. Once again I don’t pretend to have covered every single scenario.

But why are 21025 bad? As Steve addresses in its blog post 21025 have the bad habits of taxing CPU. From my observations 21025 on agents can be of different types in terms of resource utilization:

  1. 21025 caused by a new MP or Override that loads just the affected MP (since these load just one or a few MPs they’re not loading the system, obviously if do not import and override every minute)
  2. 21025 caused by a new instance or a property change, in R2 these reload the config without reparsing the MPs, they just update the OpsMgrConnector.Config file (once again very low impact, obviously if your discovery is not running every minute)
  3. 21025 caused by ??(@!@@!*!), these reparse a lot of MPs causing prolonged CPU spikes, single core machines or old one are especially affected by these

With MP reparsing I’m referring to the process of reparsing and reloading MPs from the local cache (%programfiles%\System Center Operations Manager 2007\Health Service State\Management Packs).

21025 on the RMS are worse, in addition to the above they fire an internal task towards all agents involved in distributed roll up monitors for which the RMS thinks it hasn’t up to date state. This internal task will not log a 21025 event on the agents, but will cause a complete MP cache reload. This behavior has been introduced by KB 958490 (Patching blues – QFE 958490 pumps up HealthService CPU Usage on agents) and confirmed with some notable fixing in R2 (with R2 it hasn’t the bad and diffuse effect described in my post). So if, for any reason, you have agents that are not able to calculate a state involved in a distributed workflow, they are impacted at every 21025 on the RMS. I have 10 of these on about 300 agents deployed.

My investigation still misses a key part, the extra causes (cause 4) for 21025 on RMS, I’m working on it with the invaluable help of the product team (that still keeps too many secrets :-)) and I’ll keep you up to date on any development and hopefully with a complete list of causes and fixes. If anyone is reading and has more info to add I’m here to listen. :-)

Finally I want to document the expected sequence of events that lead to a 21025 on the agent side:

  1. 21024 event asking for up to date config to the MS
  2. 21025 event if the agent is not up to date,  21026 if it is up to date (and here we stop)
  3. a 7023, 7025, 7024, 7028 events sequence for every management group (typically just one but possibly up to 4) These are related to runas accounts
  4. if the config change involves new MPs then we have a 1200 followed by one 1201 for every MP downloaded
  5. finally we have a 1210 that states that the new configuration has been loaded

Watch out for 21025 not followed by 1210 this can indicate corrupted configuration has been downloaded.

image

This is all for this first post, stay tuned for more info on the topic.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. Troubleshooting 21025 events – part 2 the RMS case « Quaue Nocent Docent
  2. Troubleshooting 21025 events – wrap up « Quaue Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: