Rollup monitors do not roll up


I’ve been affected by this issue (roll ups that do not) since the very first installation of OpsMgr, with R2 CU1 I was expecting a definitive fix for this, alas this is not the case.

What am I talking about? Take a look

clip_image015image

 

This issue seems to be especially related to gateway managed agents and got worse with CU1. In my experience the issue manifest itself when there’s a dependecy roll up monitor in the health model for the entity. A typical case is the Health Service Watcher class where the general health depends on the Health Service class health.

This nasty issue seems related to some race or timing condition when state change events reach the hosting health service and it needs to recalculate the dependecy monitor. This timing condition is not always easy ro reproduce, for a couple of days I’ve been able to repro the issue in a very simple lab with just one agent, but then the same repro steps stopped to produce the issue. In my production environment (400 agents), when a gateway stays down for a while I always have the problem.

If you want to give it a try, the repro steps are easy in a lab with just one agent managed by one gateway (using the adt forwarder service “Audit Collection Service” monitor):

1) Stop the adtagent service and wait until your health service watcher turns red in console

2) Stop the gateway serving your agent and wait until the agent turns “gray” (too many failed heartbeats). This is  must, if the agent is not marked as unreachable everything works fine

3) Start the adtagent service and wait say 60” seconds just to be sure the local agent had a chance to recalculate the monitor locally

4) Start the gateway and wait for the data to flow in, et voilà the agent health is green but the health service watcher remains red with the dependency rollup for agent availability red even if all the dependent monitors are green

 image

I tried to dig inside this issue to find a workaround, I first started with powershell / sdk but since rollup monitors won’t respond to resets (at least in my environment) I turned to good old TSQL developing a recalculate health procedure, just to find out that the culprit is not the db but the local agent cache (in the case of the Health Service Watcher the RMS cache).

I observed that entity health state is persisted by the hosting Health service (HS) in the local cache (table HEALTH-[MGGUID]). The health state persisted locally is different from the database picture (change state events missing?). Using Marius’ runtime health explorer (left) and comparing with the database view (right) this is evident:

image

A dirty (and not definitive) solution is to delete the health service state directory on the RMS, in this way all rollups hosted by the RMS are recalculated. I had to implement this horrible workaround once a day with a scheduled task to keep our agent health view clean. Obviously this is fine for entities manged by the RMS, but if we entities with dep monitor managed by other HSs I suspect we can have the same issue and have to reset those caches as well. Hopefully the team will provide us a definitive fix for this.

In the end my hypothesis is state change events are lost in certain cases (agent unreacheable?) and/or not all state change events reach the database, take a look at the following screenshots:

clip_image002

Rollup changed to warning on 2.19 on root monitor

clip_image004

Roll up turned to green at same time for Availability rollup monitor

clip_image006

Caused by the HealthService dependency rollup (the target is the watcher)

clip_image008

But no state change events for contributing monitors (interesting uh?)

clip_image010

Performance rollup turned from yellow to green at same time  2.19 (here comes the yellow)

clip_image012

Due to Performance dependency rollup state change

clip_image014

With no state change events in contributing monitors

clip_image015

Please let me know if you have the same issue or a better solution to keep your health state view consistent.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. #1 by Daniele Muscetta on February 13, 2010 - 12:56 pm

    JET performance on the RMS is critical… we mostly think of analyzing and sizing the SQL database appropriately… only to find out later that JET on the RMS is critical too, and it is much more of a bottlneck.
    The RMS can be made to scale by: increasing hardware (you are using 64bit, right?), improving the disk I/O on the health service and config service stores, and – ultimately – by reducing config churn and general tuning.
    Sure, there might be more bugs and improvements possible but, in general, the strategy I suggest is that of load reduction thru fine tuning: the less “movements” the data has to do, the less writes (everywhere, over the whole chain: from agent to db – be them state change events or discovery or any other data point), the better.

    • #2 by Daniele Grandini on February 13, 2010 - 3:07 pm

      Daniele I fear this is not a matter of overload I was able to reproduce the behavior in a single agent lab, btw there is no overload excuse for missing stage change events unless you fill up the agent cache and even in that case there should be a resync process to keep things up to date (and indeed there was). Thanks to very aggressive tuning the edb on my RMS is 512 MB even if we’re monitoring 400+ agents, while it is correct that edb performance and sizing are important topics in RMS capacity panning I think this is not the case. What I think is that the cluster fix introduced in CU1, the one related to internal task wrong routing for rollup monitors, has something to do with this. Maybe they just decided to send these resync tasks a little to rarely… and once again no execuses for missing state change events

  1. Your dependecy rollups… « Quae Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: