I’ve been affected by this issue (roll ups that do not) since the very first installation of OpsMgr, with R2 CU1 I was expecting a definitive fix for this, alas this is not the case.
What am I talking about? Take a look
This issue seems to be especially related to gateway managed agents and got worse with CU1. In my experience the issue manifest itself when there’s a dependecy roll up monitor in the health model for the entity. A typical case is the Health Service Watcher class where the general health depends on the Health Service class health.
This nasty issue seems related to some race or timing condition when state change events reach the hosting health service and it needs to recalculate the dependecy monitor. This timing condition is not always easy ro reproduce, for a couple of days I’ve been able to repro the issue in a very simple lab with just one agent, but then the same repro steps stopped to produce the issue. In my production environment (400 agents), when a gateway stays down for a while I always have the problem.
If you want to give it a try, the repro steps are easy in a lab with just one agent managed by one gateway (using the adt forwarder service “Audit Collection Service” monitor):
1) Stop the adtagent service and wait until your health service watcher turns red in console
2) Stop the gateway serving your agent and wait until the agent turns “gray” (too many failed heartbeats). This is must, if the agent is not marked as unreachable everything works fine
3) Start the adtagent service and wait say 60” seconds just to be sure the local agent had a chance to recalculate the monitor locally
4) Start the gateway and wait for the data to flow in, et voilà the agent health is green but the health service watcher remains red with the dependency rollup for agent availability red even if all the dependent monitors are green
I tried to dig inside this issue to find a workaround, I first started with powershell / sdk but since rollup monitors won’t respond to resets (at least in my environment) I turned to good old TSQL developing a recalculate health procedure, just to find out that the culprit is not the db but the local agent cache (in the case of the Health Service Watcher the RMS cache).
I observed that entity health state is persisted by the hosting Health service (HS) in the local cache (table HEALTH-[MGGUID]). The health state persisted locally is different from the database picture (change state events missing?). Using Marius’ runtime health explorer (left) and comparing with the database view (right) this is evident:
A dirty (and not definitive) solution is to delete the health service state directory on the RMS, in this way all rollups hosted by the RMS are recalculated. I had to implement this horrible workaround once a day with a scheduled task to keep our agent health view clean. Obviously this is fine for entities manged by the RMS, but if we entities with dep monitor managed by other HSs I suspect we can have the same issue and have to reset those caches as well. Hopefully the team will provide us a definitive fix for this.
In the end my hypothesis is state change events are lost in certain cases (agent unreacheable?) and/or not all state change events reach the database, take a look at the following screenshots:
Rollup changed to warning on 2.19 on root monitor
Roll up turned to green at same time for Availability rollup monitor
Caused by the HealthService dependency rollup (the target is the watcher)
But no state change events for contributing monitors (interesting uh?)
Performance rollup turned from yellow to green at same time 2.19 (here comes the yellow)
Due to Performance dependency rollup state change
With no state change events in contributing monitors
Please let me know if you have the same issue or a better solution to keep your health state view consistent.
This posting is provided "AS IS" with no warranties, and confers no rights.