Archive for November 2009
The case of post dated monitoring data
A few week ago a customer of ours has been hit by a time issue, the internal time reference server jumped to anno domini 2020. Since this happened during the weekend it took a few hours to be fixed, in the meantime opsmgr agents did their job posting data to the opsmgr infrastructure. The net result was a bunch of monitors unhealthy with last changed time set to 2020. Annoying I thought I need to reset all the broken monitors via powershell. Alas this was not only annoying, but blocking as well. Monitors won’t reset nor their state change in any case.
First consideration the opsmgr data access layer should block any data insertion with a too large time skew from its own date time reference (10’ to 15’ should be the maximum threshold immo).
Anyway, time for some reverse engineering once again.
First of all a few queries to identify the bogus state data. We have two tables involved here StateChangeEvent and State. The former collects all event state events, the ones you can check in health explorer, the latter reports the last known state for any given managed entity / monitor pair.
Easy enough, let check for all data updated after December 31st
select * from dbo.StateChangeEvent
where TimeGenerated > ‘12-31-2009′
select ME.FullName, M.MonitorName, State.* from dbo.State with (nolock)
inner join dbo.BaseManagedEntity ME with (nolock) on ME.BaseManagedEntityId=State.BaseManagedEntityId
inner join dbo.Monitor M with (nolock) on M.MonitorId=State.MonitorId
where State.LastModified > ‘12-31-2009′
Obviously my first though has been lets modify the LastModified field, but here we’re in the unsupported realm and before any mods a further analisys of the insight working needs to be accomplished. The core stoerd procedure for any stage change turned to be
PROCEDURE [dbo].[p_StateChangeEventProcess]
(
@BaseManagedEntityId uniqueidentifier,
@EventOriginId uniqueidentifier,
@MonitorId uniqueidentifier,
@NewHealthState tinyint,
@OldHealthState tinyint,
@TimeGenerated datetime,
@Context nvarchar(max) = NULL
)
this one in turns calls
PROCEDURE [dbo].[p_StateUpsert]
(
@BaseManagedEntityId uniqueidentifier,
@MonitorId uniqueidentifier,
@HealthState tinyint,
@LastModified datetime
)
and if p_StateUpsert returns with success it will insert a row in the StateChangeEvent table.
p_StateUpsert, among other checks, sets a control on the state update date time if it is earlier in the timeline respect the last time a monitor state has been updated the state change is discarded. This makes sense since state change are not guaranteed to arrive in chronological order. At the same time without a control on a time skew we can have a dos here.
Anyway from my analysis the LastModified field can be safely changed (still unsupported realm):
update dbo.StateChangeEvent set TimeGenerated=TimeAdded
where TimeGenerated > ‘12-31-2009′
update dbo.State set LastModified = GETUTCDATE()
where State.LastModified > ‘12-31-2009′
From this change on state changes will restart to flow in.
Issues: monitor needs to be reset or you must wait for the first state change for them to be updated or you could use Marius’ utility Tool- OpsMgr 2007 – RuntimeHealthExplorer or you could use a powershell script to reset all the postdated monitors. The basic statements need to be:
$obj = Get-MonitoringObject -id:<<basemanagedentityid from previous queries>>
$obj.ResetMonitoringState([guid]’<<monitorid from previous queries’)
Last Warning: if you reset the monitor from UI and from a Watcher view then the new healthstate won’t rollup (at least in my env). For example if you reset any unit monitor related to a HealthService starting from the Health Explorer for the related HealthServiceWatcher, the unit monitor will reset but the new status won’t rollup. If you do the same reset from the HealthService Health Explorer view it will rollup.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
A rollup you don’t want to miss
If you’re still running OpsMgr 2007 SP1 you definitely need to apply the rollup package has been released yesterday (finally): Update Rollup for Operations Manager 2007 Service Pack 1 (KB971541).
It should set an end to the patching blues I complained so many times about.
Hopefully a similar rollup will be delivered for R2 sooner than later from what I know. Some issues fixed in SP1 are still present in R2.
In the and my advice is if you’re still on SP1 it is still a good idea to move to R2.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
.
Failover cluster monitoring – quick insight
While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have no access to source code so I can go wrong on some assumptions.
The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.
The dll exports:
- ClusterGroupStateChange the name gives us some clues
- ClusterDiscovery this one is used by the discovery workflow
Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.
One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”
If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows
Event Type: Warning
Event Source: HealthService
Event Category: Health Service
Event ID: 1207
Date: 10/17/2009
Time: 3:01:23 PM
User: N/A
Computer: ARES1
Description:
Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".
This is it, straightforward but useful if you need to debug issues on your clustered agents.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.