Quae Nocent Docent

What hurts, teaches – Ordinary tales from management trenches

Post dated data – reprise

leave a comment »

I wrote about the effects of post dated data flowing in the OpsMgr db in this post of mine The case of post dated monitoring data. There I focused on health state issues caused by off time data, but I hadn’t realized, until today, that post dated data affects disocvery as well in a more subtle way. In fact I didn’t notice the stale discovery data until I updated to R2 CU1 and checked for agent update status. With my disconcert some agents, even if correctly updated, were not reflecting their status in console. After a quick troubleshooting session what I found is the same mechanisms used for health state data apply to discovery data. Precisely every time a new discovery flows in, it gets validated by the dbo.p_DiscoverySourceUpsert stored procedure. The validation is based on the TimeGeneratedOfLastSnapshot if the data is older than the recorder snapshot is updated otherwise it gets discarded. And this was the case, during the w32time incident some data has been discovered by the agents and from then on all the newly discovered data has been discarded, not easy ton detect.

As usual the solution has been an unsupported database mod. First to see if you have any postated data you should use the following query againt the OpsMgr db:

select * from   [dbo].[DiscoverySource] d

inner join BaseManagedEntity B  on d.BoundManagedEntityId = B.BaseManagedEntityId

where d.IsDeleted = 0 and B.IsDeleted = 0 and TimeGeneratedOfLastSnapshot > GETUTCDATE()

The you can get rid of the issue updating the same table resetting the TimeGeneratedOfLastSnapshot to "now" so that all newly generated discovery will have a chance to be older than the one in the db hence being recorded:

update [dbo].[DiscoverySource]  set TimeGeneratedOfLastSnapshot = GETUTCDATE() from [dbo].[DiscoverySource] d inner join BaseManagedEntity B  on d.BoundManagedEntityId = B.BaseManagedEntityId

where d.IsDeleted = 0 and B.IsDeleted = 0 and TimeGeneratedOfLastSnapshot > GETUTCDATE()

 

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

February 4, 2010 at 7:31 pm

Posted in Bug, Debugging, SCOM

Rollup monitors do not roll up

leave a comment »

I’ve been affected by this issue (roll ups that do not) since the very first installation of OpsMgr, with R2 CU1 I was expecting a definitive fix for this, alas this is not the case.

What am I talking about? Take a look

clip_image015image

 

This issue seems to be especially related to gateway managed agents and got worse with CU1. In my experience the issue manifest itself when there’s a dependecy roll up monitor in the health model for the entity. A typical case is the Health Service Watcher class where the general health depends on the Health Service class health.

This nasty issue seems related to some race or timing condition when state change events reach the hosting health service and it needs to recalculate the dependecy monitor. This timing condition is not always easy ro reproduce, for a couple of days I’ve been able to repro the issue in a very simple lab with just one agent, but then the same repro steps stopped to produce the issue. In my production environment (400 agents), when a gateway stays down for a while I always have the problem.

If you want to give it a try, the repro steps are easy in a lab with just one agent managed by one gateway (using the adt forwarder service “Audit Collection Service” monitor):

1) Stop the adtagent service and wait until your health service watcher turns red in console

2) Stop the gateway serving your agent and wait until the agent turns “gray” (too many failed heartbeats). This is  must, if the agent is not marked as unreachable everything works fine

3) Start the adtagent service and wait say 60” seconds just to be sure the local agent had a chance to recalculate the monitor locally

4) Start the gateway and wait for the data to flow in, et voilà the agent health is green but the health service watcher remains red with the dependency rollup for agent availability red even if all the dependent monitors are green

 image

I tried to dig inside this issue to find a workaround, I first started with powershell / sdk but since rollup monitors won’t respond to resets (at least in my environment) I turned to good old TSQL developing a recalculate health procedure, just to find out that the culprit is not the db but the local agent cache (in the case of the Health Service Watcher the RMS cache).

I observed that entity health state is persisted by the hosting Health service (HS) in the local cache (table HEALTH-[MGGUID]). The health state persisted locally is different from the database picture (change state events missing?). Using Marius’ runtime health explorer (left) and comparing with the database view (right) this is evident:

image

A dirty (and not definitive) solution is to delete the health service state directory on the RMS, in this way all rollups hosted by the RMS are recalculated. I had to implement this horrible workaround once a day with a scheduled task to keep our agent health view clean. Obviously this is fine for entities manged by the RMS, but if we entities with dep monitor managed by other HSs I suspect we can have the same issue and have to reset those caches as well. Hopefully the team will provide us a definitive fix for this.

In the end my hypothesis is state change events are lost in certain cases (agent unreacheable?) and/or not all state change events reach the database, take a look at the following screenshots:

clip_image002

Rollup changed to warning on 2.19 on root monitor

clip_image004

Roll up turned to green at same time for Availability rollup monitor

clip_image006

Caused by the HealthService dependency rollup (the target is the watcher)

clip_image008

But no state change events for contributing monitors (interesting uh?)

clip_image010

Performance rollup turned from yellow to green at same time  2.19 (here comes the yellow)

clip_image012

Due to Performance dependency rollup state change

clip_image014

With no state change events in contributing monitors

clip_image015

Please let me know if you have the same issue or a better solution to keep your health state view consistent.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

February 4, 2010 at 6:03 pm

Posted in Bug, SCOM

Eventually we got cumulative 1 for R2

leave a comment »

It will resolve the cluster issues I reported on and the annoying issue with health state mismatch, among the others. Strongly recommended, after proper testing.

Cumulative Update 1 for System Center Operations Manager 2007 R2

Cumulative Update 1 contains a number of fixes for the Operations Manager 2007 R2 release. A number of fixes require manual steps to install. See Knowledgebase article 974144 for details of included fixes and installation steps.

CU1 KB article http://support.microsoft.com/kb/974144

For Operations Console only installation (i.e. your operators workstations) you should choose "Run Server Update", this will update your console files.

image

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

January 16, 2010 at 12:34 pm

Posted in Failover cluster, KB, SCOM

Fighting against ACS queries

leave a comment »

As many of you know the Audit Collection Service (ACS) is a OpsMgr component aimed at collecting events from security event logs. Many design and capacity planning guides focus on event insertion rates, but this is just one side of the coin, once events are collected they are pretty useless if you’re not able to query them.

This installment is about efficiently query the ACS database and related caveats. I will present several execution scenarios and a few optimizations possible on the run without changing the ACS database schema.

My testing bed has been a production environment with a 300 GB ACS database hosted by an 8 cores 8 gigs SQL 2008 SP1 x64 box running on Windows Server 2003 x64 with an 8 spindles RAID1+0 array. The number of online partitions (days) were 61.

Every execution time has been averaged on 3 runs, the base selection time range was 24 hours with a potential target of 16 millions rows.

ID Rows returned select * time select count(*) time View Notes
1 2393K 5’ 37” 10” dbo.dvAll5_GUID Selecting on about 30 different event ids using a IN clause
2 2393K 35’ 17” 17” AdtServer.dvAll5  
3 600K 1’ 05” 1” dbo.dvAll5_GUID Selecting on a single event id
4 2393K 3’26” 12” AdtServer.dvHeader Selecting on about 30 different event ids using a IN clause
5 926 2’41” n.a. AdtServer.dvHeader Same selection filtered by user name

 

From the above table we can infer a few conclusions:

  1. the time difference between select * and select count(*) is, for the most part, due to the fully normalized db schema. The more fields are returned the more joins need to be resolved, so it is generally a bad idea to use a select *, you should return just the fields that you need. As a corollary using just the Header views is better than the All5 or All views, but this is a well documented fact so I won’t argue more.
  2. The huge time difference between targeting the single partition view (i.e. queries 1 and 3) or the generic full view (i.e. 2) is due the fact that SQL need to check every single partition (61 in my case) for relevant data. This implies that your query execution time is not only influenced by the selectivity of your where clause, but by the number of  online partitions as well, regardless of the time range selected. I noticed that the execution time is influenced in a more than linear way (even if I cannot say exponential) by the number of the partitions. With a little add the the ACS db schema I’ve been able to reduce execution time of 2 to 5′07" or exactly inline with 1. Unfortunately I’ve been asked by the product team not to share this hack since they have to verify it won’t have any side effects. To give the right credit the idea for the mod aroused after a insightful discussion with a colleague of mine, Alessandro Nostini, on database partitioning for a software project of ours.
  3. The execution times have significance only for pure SQL query, if you use these queries in a SQL Server Reporting Services report you must take into account some more factors, not least the number of rows returned. If you’re curious about SRS optimizations and troubleshooting I found the following posts really interesting:
    1. <http://blogs.msdn.com/robertbruckner/archive/2009/01/05/executionlog2-view.aspx>
    2. Pet Peeve: Slow Reports <http://blogs.msdn.com/deanka/archive/2009/01/13/pet-peeve-slow-reports.aspx>

I add the queries I used for your reference, as you can see we use a lookup database to avoid to code event ids inside the sql statements, the lookup will just return a list of event ids

1. –16M rows partition
select COUNT(*) from dbo.dvAll5_19d13be1_6975_4cad_8830_b4487a6c5427
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
– 10"

select * from dbo.dvAll5_19d13be1_6975_4cad_8830_b4487a6c5427
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
– 5′37"

2. – using the full view

select COUNT(*) from AdtServer.dvAll5
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
– 17"

select * from AdtServer.dvAll5
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
– 35′17"

3. — single eventid
select * from dbo.dvAll5_19d13be1_6975_4cad_8830_b4487a6c5427
where EventId in (4624)
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
– 1′

– from header only
declare @user nvarchar(50)
Set @user=’myadmin’
select * from AdtServer.dvHeader ACS
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between ‘2009-11-29′ and ‘2009-11-30′
AND     (ACS.EventID IN (528, 538, 540,529,531) AND ACS.HeaderUser Like @User AND ACS.HeaderUser not like ‘%$%’)
        OR
        (ACS.EventID IN (4624,4634,4625) AND ACS.TargetUser Like @User AND ACS.TargetUser not like ‘%$%’)
        OR
        (ACS.EventID >27000 AND ACS.PrimaryUser Like @User)
– 2′55"

 

4. — simplified

select * from AdtServer.dvHeader ACS
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (‘Logon’, ‘Logoff’) and Type in (8,16))
AND CollectionTime between @startdate and @endDate
AND     (ACS.HeaderUser = @user     OR
        ACS.TargetUser = @user
        OR
        ACS.PrimaryUser = @user)

– 3′26"

declare @query nvarchar(4000)
Set @query=N’
select * from AdtServer.dvHeader ACS
where EventId in (select EventId from ProgelACSLookup.dbo.SecurityEventLookup where Category in (”Logon”, ”Logoff”) and Type in (8,16))
AND CollectionTime between ”2009-11-30” and ”2009-12-02”
AND     (ACS.HeaderUser = N”myadmin”     OR
        ACS.TargetUser = N”myadmin”
        OR
        ACS.PrimaryUser = N”myadmin”)’

execute (@query)

– 3′26"

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

January 14, 2010 at 8:30 pm

Posted in ACS, SCOM

Exchange 2010 monitoring with OpsMgr

leave a comment »

Last week I attended an interesting session on Exchange 2010 monitoring and backup with System Center host by the System Center Influencer guys (https://connect.microsoft.com/SystemCenterCommunity?wa=wsignin1.0).

Two main topics has been covered:

  1. The new Exchange 2010 Management Pack for OpsMgr
  2. Using DPM to backup Exchange 2010

In this post I will focus on the first topic, but I want to add that you should take a serious look at DPM and when you do this, please try to have a new perspective. Do not try to map what you’re currently doing with your backup and recovery solution to DPM, rather you should learn DPM paradigm and see your data assets in this new perspective. It is a new backup paradigm and you’ll need to do things differently… the worst DPM implementations I saw were the ones where DPM has been used like a "traditional" backup software. Not the way to go.

The only pitfalls I must make you aware of is that you need to wait for the next version of DPM to protect Exchange 2010, and this is a pity given the tons of functionality in DPM and the fact that current DPM users must delay Exchange 2010 adoption ’til the release of DPM.

Returning to OpsMgr, the MP has been released soon after Exchange 2010 delivery, in this case we have no road block. We can adopt Exchange 2010 and being able to monitor it with our current OpsMgr infrastructure.

SO what’s new in this MP:

· Alert correlation

· Alert classification

· Mail flow statistics reporting

· Full protocol synthetic transaction coverage

· Service-oriented reporting

· Exchange aware availability modeling

image 

The most interesting stuff in there is the new correlation engine, it’s a stand alone service that help reducing noise and who is in charge of generating alerts into OpsMgr via the SDK. A brand new approach in MPs and something I will return on in future posts.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

December 7, 2009 at 12:35 pm

Posted in Exchange 2010, MP, SCOM

Decommissioning a gateway server when used with Sites

with one comment

Our outsourced service based on OpsMgr uses a gateway infrastructure to monitor customers sites. Every customer has an appropriate number of gateways reporting to our data center. We used the site concept to map our customers, so every customer is a site. When we approve the gateway we specify the site name. From time to time gateways need to be replaced, moved or added.

Replacing a gateway with a new one is the main topic of this article, and no, it’s not easy nor straightforward.

Let’s recap the scenario:

  • gateway server needs to be replaced
  • agents are assigned to the gateway at install time, no Active Directory integration
  • gateway is associated to a Site at approval time

The common action plan is the following:

  1. add a new gateway ad associate it to the same Site of the old one
  2. from OpsMgr console move the agents to the newly added gateway
  3. uninstall the old gateway
  4. remove the gateway approval with microsoft.enterprisemanagement.gatewayapprovaltool and action=delete

Adding a new gateway is a no brainer, the product documentation is clear enough so I won’t spend time on it. Moving the agent to a new MS is straightforward as well, and you can do it from the UI, right? Wrong.

Let’s start with a quick background on how Management Servers (MS) and agents work together, remember the gateway is a type of MS. When you change the MS agent for an agent, the old one removes the agent from the managed agents, the new one, on the other side, adds the agent to the authorized list. If the agent is not in the authorized list the MS won’t respond to it and it will register an event in the event log saying the agent is not authorized for this MS. Agents get their configuration information from the assigned MS a MS switch is part of this configuration. Now the question is: how can an agent learn it should refer to a new MS? In this situation simply it cannot. We have closed our agent out of the door. The agent cannot receive any configuration from old MS because it is not in the authorized list, so it won’t learn it should get its configuration from the new MS. I saw a few agents trying to refer to the RMS in this situation, saying something like "failing over to secondary MS (RMS)", I don’t know if this is the expected behavior to avoid this deadlock, but for sure it cannot work for gateways. Gateways are in untrusted domains their managed agents are not able to reach the RMS.

If you are in this situation, the first thing to do is to regain control of your agents, to do so you should reset your agents to the old gateway.

Now that we’re in control once again let’s do a different action plan:

  1. add the new MS to the secondary MS of the agents
  2. Wait for the new configuration to propagate (21025 event)
  3. Switch the primary and secondary server so that the new MS will become the primary one.
  4. Remove the old MS from the agents

This can be done via Active Directory integration or via powershell. Since we cannot use AD, let’s share a simple powershell script:

#old ms
$msp=Get-ManagementServer | where {$_.Name -eq ‘gw1.somedomain.it’}
#new ms
$ms=Get-ManagementServer | where {$_.Name -eq ‘gw2.somedomain.it}
$failoverServers = New-Object System.Collections.Generic.List“1"[[Microsoft.EnterpriseManagement.Administration.ManagementServer,Microsoft.EnterpriseManagement.OperationsManager,Version=6.0.4900.0,Culture=neutral,PublicKeyToken=31bf3856ad364e35]]"
$failoverServers.Add($ms)
$agents = $msp.GetAgentManagedComputers()
foreach ($a in $agents)
{
    $a.SetManagementServers($msp, $failoverServers)
}

The script add the new gateway ($ms) as a failover server to the agents managed by the old one ($msp). You can then test the move from UI. Once you have moved a few agents and checked their working as expected with the new gateway, you can move over the other agents. But first remember to give enough time to the agents to get the new configuration (they must learn the new failover MS list), if not you will close the agent out of the door once again.

# wait for 21025 on every MS and agent and reset the primary GW and the failover MS list
$agents = $msp.GetAgentManagedComputers()

foreach ($a in $agents)
{
    $a.SetManagementServers($ms, $null)
}

Steps 1 and 2 completed.

Removing the gateway binaries from the old gateway is a matter of running the uninstall procedure, by the way you can simply turn off the gateway and you’ll get the same effect. In either case you must manually remove the gateway from the approved gateways list of your OpsMgr management group.

Let’s move on to step 4. Using the gateway approval tool with /Action=delete we are supposed to be able to remove the gateway:

image

Interesting, isn’t it? We just removed all the agents from the gateway management space, but still we got this error, and the gateway stays there.

Using SQL trace is simple to track down the steps performed by the tool:

  1. it looks for a Microsoft.SystemCenter.GatewayManagementServer with the given gateway name
  2. it looks for relationships of type Microsoft.SystemCenter.HealthServiceCommunication and Microsoft.SystemCenter.HealthServiceShouldManageEntity related to the gateway

Replaying the queries becomes evident the gateway has still a relationship of type HealthServiceShouldManageEntity with the associated Site. To remove the gateway we must remove that relationship, but we have no supported way to do that, at least as far as I know. So this is my *unsupported* way to decommission a gateway when it has been associated with a Site.

First check if we really are in the situation where the only relationship still in place is the one related to the site (if not we must check what’s left). For this reason I will split the query in two, the first part will return a list of relationships and the second one will mark the relationship deleted:

declare @nodeHS nvarchar(255)

Set @nodeHS=N’Microsoft.SystemCenter.HealthService:gateway.somedomain.it’

(SELECT *

FROM dbo.RelationshipGenericView

WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()) AND (((dbo.[RelationshipGenericView].[IsDeleted] = 0))))

AND ([RelationshipGenericView].[SourceMonitoringObjectId]

IN (select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))

If we have just one releationship left with the site, we can mark the relationship deleted (unsupported, you know):

update Relationship set IsDeleted=1 where RelationshipId=’Relationship GUID returned by the previous query’

In conclusion, decommissioning and replacing a gateway server has some caveats, if you add to the equation a site relationship then you need to perform the above hack.

Let me know if this works for you.

For future reference I list the queries performed by the gateway approval tool:

exec sp_executesql N’– MTV_SelectProperty_c1721bcc-35f7-5a49-5d5f-6880687c3d48 <ManagedTypeId,PrincipalName0>

SELECT [MTV_HealthService].[BaseManagedEntityId], [MTV_HealthService].[DisplayName_55270A70_AC47_C853_C617_236B0CFF9B4C], [MTV_HealthService].[ActionAccountIdentity], [MTV_HealthService].[ActiveDirectoryManaged], [MTV_HealthService].[AuthenticationName], [MTV_HealthService].[CreateListener], [MTV_HealthService].[HeartbeatEnabled], [MTV_HealthService].[HeartbeatInterval], [MTV_HealthService].[InstalledBy], [MTV_HealthService].[InstallTime], [MTV_HealthService].[IsAgent], [MTV_HealthService].[IsGateway], [MTV_HealthService].[IsManagementServer], [MTV_HealthService].[IsManuallyInstalled], [MTV_HealthService].[IsRHS], [MTV_HealthService].[MaximumQueueSize], [MTV_HealthService].[MaximumSizeOfAllTransferredFiles], [MTV_HealthService].[PatchList], [MTV_HealthService].[Port], [MTV_HealthService].[ProxyingEnabled], [MTV_HealthService].[RequestCompression], [MTV_HealthService].[Version], [MTV_HealthService].[AutoApproveManuallyInstalledAgents_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[ManagementServerSCP_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[NumberOfMissingHeartBeatsToMarkMachineDown_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[ProxyAddress_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[ProxyPort_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[RejectManuallyInstalledAgents_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[UseProxyServer_9189A49E_B2DE_CAB0_2E4F_4925B68E335D], [MTV_HealthService].[WebConsoleUrl_F9069CA9_A790_E274_0C2C_DE210E57F67C], [MTV_HealthService].[SiteId_CECAAFDA_33B6_B628_0CDA_445E21B7291D], [MTV_HealthService].[SiteName_CECAAFDA_33B6_B628_0CDA_445E21B7291D], [MTV_HealthService].[PrincipalName] FROM dbo.[MTV_HealthService]

INNER JOIN dbo.[TypedManagedEntity] AS TME ON TME.[BaseManagedEntityId] = [MTV_HealthService].[BaseManagedEntityId]

WHERE (MTV_HealthService.[PrincipalName] LIKE @PrincipalName0) AND (((TME.[ManagedTypeId] = @ManagedTypeId)))’,N’@ManagedTypeId uniqueidentifier,@PrincipalName0 ntext’,@ManagedTypeId=’C1721BCC-35F7-5A49-5D5F-6880687C3D48′,@PrincipalName0=N’gateways.somedomain.it’

– TypeID ‘C1721BCC-35F7-5A49-5D5F-6880687C3D48′ = Microsoft.SystemCenter.GatewayManagementServer

exec sp_executesql N’– RelationshipWithCriteria <TargetEntityId0,IsDeleted0,RelationshipTypeId0>

SELECT [Relationship].[RelationshipId], [Relationship].[TargetEntityId], [Relationship].[IsDeleted] FROM dbo.Relationship

WHERE Relationship.[TargetEntityId] = @TargetEntityId0 AND Relationship.[IsDeleted] = @IsDeleted0 AND Relationship.[RelationshipTypeId] = @RelationshipTypeId0′,N’@TargetEntityId0 uniqueidentifier,@IsDeleted0 bit,@RelationshipTypeId0 uniqueidentifier’,@TargetEntityId0=’FC75E426-26C5-B237-FE9F-F14F540CCB0E’,@IsDeleted0=0,@RelationshipTypeId0=’37848E16-37A2-B81B-DAAF-60A5A626BE93′

– Relationship Microsoft.SystemCenter.HealthServiceCommunication

exec sp_executesql N’– RelationshipWithCriteria <SourceEntityId0,IsDeleted0,RelationshipTypeId0>

SELECT [Relationship].[RelationshipId], [Relationship].[SourceEntityId], [Relationship].[IsDeleted] FROM dbo.Relationship

WHERE Relationship.[SourceEntityId] = @SourceEntityId0 AND Relationship.[IsDeleted] = @IsDeleted0 AND Relationship.[RelationshipTypeId] = @RelationshipTypeId0′,N’@SourceEntityId0 uniqueidentifier,@IsDeleted0 bit,@RelationshipTypeId0 uniqueidentifier’,@SourceEntityId0=’FC75E426-26C5-B237-FE9F-F14F540CCB0E’,@IsDeleted0=0,@RelationshipTypeId0=’2F71C644-E092-B80A-040B-5C81BA1EC353′

– Relationship Microsoft.SystemCenter.HealthServiceShouldManageEntity

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

December 7, 2009 at 12:04 pm

Posted in Bug, Gateway, SCOM

The case of post dated monitoring data

with 3 comments

A few week ago a customer of ours has been hit by a time issue, the internal time reference server jumped to anno domini 2020. Since this happened during the weekend it took a few hours to be fixed, in the meantime opsmgr agents did their job posting data to the opsmgr infrastructure. The net result was a bunch of monitors unhealthy with last changed time set to 2020. Annoying I thought I need to reset all the broken monitors via powershell. Alas this was not only annoying, but blocking as well. Monitors won’t reset nor their state change in any case.

First consideration the opsmgr data access layer should block any data insertion with a too large time skew from its own date time reference (10’ to 15’ should be the maximum threshold immo).

Anyway, time for some reverse engineering once again.

First of all a few queries to identify the bogus state data. We have two tables involved here StateChangeEvent and State. The former collects all event state events, the ones you can check in health explorer, the latter reports the last known state for any given managed entity / monitor pair.

Easy enough, let check for all data updated after December 31st

select * from dbo.StateChangeEvent
where TimeGenerated > ‘12-31-2009′

select ME.FullName, M.MonitorName, State.* from dbo.State with (nolock)
inner join dbo.BaseManagedEntity ME with (nolock) on ME.BaseManagedEntityId=State.BaseManagedEntityId
inner join dbo.Monitor M with (nolock) on M.MonitorId=State.MonitorId
where State.LastModified > ‘12-31-2009′

Obviously my first though has been lets modify the LastModified field, but here we’re in the unsupported realm and before any mods a further analisys of the insight working needs to be accomplished. The core stoerd procedure for any stage change turned to be 

PROCEDURE [dbo].[p_StateChangeEventProcess]
(      
    @BaseManagedEntityId uniqueidentifier,
    @EventOriginId uniqueidentifier,
    @MonitorId uniqueidentifier,
    @NewHealthState tinyint,
    @OldHealthState tinyint,    
    @TimeGenerated datetime,
    @Context nvarchar(max) = NULL
)

this one in turns calls

PROCEDURE [dbo].[p_StateUpsert]
(
    @BaseManagedEntityId uniqueidentifier,
    @MonitorId uniqueidentifier,
    @HealthState tinyint,
    @LastModified datetime
)

and if p_StateUpsert returns with success it will insert a row in the StateChangeEvent table.

p_StateUpsert, among other checks, sets a control on the state update date time if it is earlier in the timeline respect the last time a monitor state has been updated the state change is discarded. This makes sense since state change are not guaranteed to arrive in chronological order. At the same time without a control on a time skew we can have a dos here.

Anyway from my analysis the LastModified field can be safely changed (still unsupported realm):

update dbo.StateChangeEvent set TimeGenerated=TimeAdded
where TimeGenerated > ‘12-31-2009′

update dbo.State set LastModified = GETUTCDATE()
where State.LastModified > ‘12-31-2009′

From this change on state changes will restart to flow in.

Issues: monitor needs to be reset or you must wait for the first state change for them to be updated or you could use Marius’ utility Tool- OpsMgr 2007 – RuntimeHealthExplorer or you could use a powershell script to reset all the postdated monitors. The basic statements need to be:

$obj = Get-MonitoringObject -id:<<basemanagedentityid from previous queries>>

$obj.ResetMonitoringState([guid]’<<monitorid from previous queries’)

 

Last Warning: if you reset the monitor from UI and from a Watcher view then the new healthstate won’t rollup (at least in my env). For example if you reset any unit monitor related to a HealthService starting from the Health Explorer for the related HealthServiceWatcher, the unit monitor will reset but the new status won’t rollup. If you do the same reset from the HealthService Health Explorer view it will rollup.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

November 16, 2009 at 8:51 am

Posted in Bug, Debugging, SCOM

A rollup you don’t want to miss

leave a comment »

If you’re still running OpsMgr 2007 SP1 you definitely need to apply the rollup package has been released yesterday (finally): Update Rollup for Operations Manager 2007 Service Pack 1 (KB971541).

It should set an end to the patching blues I complained so many times about.

Hopefully a similar rollup will be delivered for R2 sooner than later from what I know. Some issues fixed in SP1 are still present in R2.

In the and my advice is if you’re still on SP1 it is still a good idea to move to R2.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

.

Written by Daniele Grandini

November 7, 2009 at 12:02 pm

Posted in KB, SCOM

Failover cluster monitoring – quick insight

with one comment

While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have  no access to source code so I can go wrong on some assumptions.

The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.

image

The dll exports:

  • ClusterGroupStateChange the name gives us some clues
  • ClusterDiscovery this one is used by the discovery workflow

Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.

One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”

image

If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows

Event Type: Warning

Event Source: HealthService

Event Category: Health Service

Event ID: 1207

Date: 10/17/2009

Time: 3:01:23 PM

User: N/A

Computer: ARES1

Description:

Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".

This is it, straightforward but useful if you need to debug issues on your clustered agents.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

November 7, 2009 at 11:57 am

Posted in Failover cluster, SCOM

Windows Scheduled task MP on MP catalog

leave a comment »

The Progel Windows Scheduled Task MP is now available from the MP catalog. We have received a single alert it has issues on extended character set locales (i.e. Traditional Chinese) but with too few details at the moment to assess if it’s our own bug or something external to us. If you manage to try the MP and encounter any issue drop an email to sst@progel.it.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

October 31, 2009 at 12:17 pm

Posted in MP, SCOM