A rollup you don’t want to miss
If you’re still running OpsMgr 2007 SP1 you definitely need to apply the rollup package has been released yesterday (finally): Update Rollup for Operations Manager 2007 Service Pack 1 (KB971541).
It should set an end to the patching blues I complained so many times about.
Hopefully a similar rollup will be delivered for R2 sooner than later from what I know. Some issues fixed in SP1 are still present in R2.
In the and my advice is if you’re still on SP1 it is still a good idea to move to R2.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
.
Failover cluster monitoring – quick insight
While I was trying to understand why my cluster nodes won’t dismiss I dug a little more inside cluster monitoring with opsmgr. As usual I have no access to source code so I can go wrong on some assumptions.
The core logic behind cluster discovery and management is coded (natively) in mommodules.dll.
The dll exports:
- ClusterGroupStateChange the name gives us some clues
- ClusterDiscovery this one is used by the discovery workflow
Basically every cluster node discovers every resource group (Virtual Server) in the cluster and establish a relationship of type HealthServiceShouldManageEntity. This tells OpsMgr infrastructure to route the workflows for any Virtual Server to every cluster node. In this scenario every cluster node receives all the workflows even for VS it is not owning at the moment. Obviously we just want the owning node to monitor the proper VS. Without some custom logic here we would have an issue, in fact the agent has builtin logic to understand which VS it is supposed to manage (i.e. it is owning). On the passive nodes the workflows get unloaded. Second issue to face is the management of VS failover (i.e. when a resource group changes owning node). From what I understand the agent uses the ClusterGroupStateChange to understand when a VS changes ownership, I measured a 60” maximum delay from resource group failover to workflows reload on the proper node. So far so good, the agent (as we expect) is able to manage the VS where ever it is. I had a couple of cases where this was not working properly on SP1 and it resolved restarting the health service.
One more thing to add, the VSs are managed by the healthservice as a proxied systems, this has an important implication if you’re a MP author: all the workflows you want to execute against a VS must be tagged as remotable=”true”
If you miss this important requirement you’ll get event id 1207 after the agent reloads the workflows
Event Type: Warning
Event Source: HealthService
Event Category: Health Service
Event ID: 1207
Date: 10/17/2009
Time: 3:01:23 PM
User: N/A
Computer: ARES1
Description:
Rule/Monitor "QND.Test.Cluster.LogEvent" running for remote instance "XXXX.progel.org" with id:"{A16D2CDA-378D-E9AC-7913-404A9999BEEE}" will be disabled as it is not remotable. Management group "Progel Labs".
This is it, straightforward but useful if you need to debug issues on your clustered agents.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Windows Scheduled task MP on MP catalog
The Progel Windows Scheduled Task MP is now available from the MP catalog. We have received a single alert it has issues on extended character set locales (i.e. Traditional Chinese) but with too few details at the moment to assess if it’s our own bug or something external to us. If you manage to try the MP and encounter any issue drop an email to sst@progel.it.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Disk performance reporting
Disk performance reporting and trending is probably the most difficult part of performance troubleshooting and capacity planning and trending. During the years I used several counters just to understand none of them can give a synthetic and accurate answer on disk performance. Ever tried to user % idle time or % disk time? Or Avg queue length if that matters. In the last few years I standardized on Avg Disk Sec / read / Write / Transfer as a single indicator of disk responsiveness. Generally I take 20ms to 30ms as a threshold should not be exceeded on average.
However, even these counters are prone to errors (I don’t know when the OS team will change the perf counters architecture, but it will never be too early). First of all you must be aware of the following issues that arise with virtualized Windows 2003 servers with more than one core:
- http://blogs.technet.com/yongrhee/archive/2009/08/04/disk-performance-counters-are-high-are-you-running-into-the-usepmtimer-rtsc-time-drift-issue.aspx
- Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2000, in Windows Server 2003, and in Windows XP
But issues are there to hit on Windows 2008 as well. Look at the following table that reports Avg Disk sec / Transfer on a Windows 2008 server:
As you can see, among “normal” values (0,xxx , is the decimal separator in Italy), we have clearly bad ones. It is obvious the server is not taking 93” or 3139” seconds (little less than 1 hour) to execute an I/O on average. The presence of these bad values can wrack havoc your reporting experience, the average of the above values is 285”, now it is clear this cannot be the case.
I didn’t find a root cause for this behavior. I can observe it on several Windows 2008 servers with a predominance of hyper-v hosts, it can be hardware related or just a bug in the perf counter (immo both), in any case your reports are doomed.
The only thing I can advice on is to change your SQL query to filter out obviously bad values. For example filtering out response time above 2 seconds, changes my average on the period to 0.029 seconds or 29 msec that denotes a fairy busy storage subsystem.
If I manage to find more info on this issue I’ll keep you posted, in the meantime take your reports on disk response time with a grain of salt.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
R2 and cluster monitoring – issues and pitfalls
I think it’s time to wrap the various answers we can find on cluster monitoring and decommissioning in R2.
Failover clustering is an high availability solution and my quote is “if you want to keep it HA you need to monitor it”. So we all expect OpsMgr to do an egregious job in monitoring failover clusters. In some respect it does but you must be aware of glitches that are still around.
I observed at least 3 issues on failover cluster monitoring:
- abnormal CPU usage on the cluster node and on the controlling management server
- simply put you cannot decommission a cluster in a supported way
- the agentless managed via in the administration space of the monitoring console is just useless
- cluster discovery is noisy
Abnormal CPU usage is caused by internal tasks taking the wrong route. In some cases the config service on the RMS thinks resources are on the wrong node (i.e a node that’s not owning the resources) and sends health recalculation tasks to that node. In this cases you’ll see the following:
- healthservice edb grows we had examples of > 3GB, this grow is caused by tasks queuing in
- healthservice.exe on the affected node consumes up to one cpu core
- healthservice.exe on the primary management server registers an increased cpu usage
We’re waiting for a fix from PSS, at the time the only work around is a group failover, but hey I’m supposed to run critical applications on my clusters a failover cannot be considered a workaround. Btw things can return bad after a while, so I would define this a temporarily rag that will break again sooner or later.
Decommissioning is another issue, the short story is that you cannot remove agents from cluster nodes, but you cannot remove the agents from your console. In other words you have zombies in console.
There’s an *unsupported* workaround here http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!163.entry and a discussion thread here http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/10ffee08-b875-47af-b788-db07dbfa1b56.
See my own unsupported way later in this post, I’m not comfy with the workaround I cite before.
I want to say that the following rapid publishing KB just doesn’t work in my case (and from what I found it should never work): OpsMgr 2007: How to decommission a cluster monitored by System Center Operations Manager 2007
I want to add that the release notes are not clear as well:
“You cannot uninstall or delete an agent from a node in a cluster
When you try to uninstall or delete an agent from a node in cluster, the following error message is displayed:
Agent is managing other devices and cannot be uninstalled. Please resolve this issue via Agentless managed view in Administration prior to attempting uninstall again.
Notice that the agent can be uninstalled from the node that is agentlessly managing the virtual servers. However, the agent cannot be uninstalled from the node that is managing the virtual servers.
Workaround: None at this time.” http://technet.microsoft.com/en-us/library/dd827187.aspx
The agentless managed view is useless in terms that the result’s you’re seeing is just unpredictable and the change proxy action won’t work. This is all related on how the agentless management works for the UI.
An agentless managed Windows Computer (this is the class) is defined as a Windows Computer managed by an healthservice on another Windows Computer (using the HealthServiceShouldManageEntity relationship), this is the query the SDK runs against your live db where the guid is the id for Microsoft.Windows.Computer:
exec sp_executesql N’– AgentlessManagedDevicesByType <ManagedTypeId>
SELECT [T].[Id], [T].[Name], [T].[Path], [T].[FullName], [T].[DisplayName], [T].[IsManaged], [T].[IsDeleted], [T].[LastModified], [T].[TypedManagedEntityId], [T].[MonitoringClassId], [T].[TypedMonitoringObjectIsDeleted], [T].[HealthState], [T].[StateLastModified], [T].[IsAvailable], [T].[AvailabilityLastModified], [T].[InMaintenanceMode], [T].[MaintenanceModeLastModified], [PXH].[BaseManagedEntityId] AS [HealthServiceId], [PXH].[DisplayName] AS [ProxyAgentPrincipalName] FROM dbo.ManagedEntityGenericView AS T
INNER JOIN dbo.BaseManagedEntity AS BME
ON
BME.[BaseManagedEntityId] = T.[Id]
AND BME.[BaseManagedTypeId] = @ManagedTypeId
INNER JOIN dbo.Relationship AS R
ON R.[TargetEntityId] = T.[Id]
INNER JOIN dbo.BaseManagedEntity AS PXH
ON PXH.[BaseManagedEntityId] = R.[SourceEntityId]
WHERE ((
T.[IsDeleted] = 0 AND T.[TypedMonitoringObjectIsDeleted] = 0 AND R.[IsDeleted] = 0 AND
R.[RelationshipTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()
))’,N’@ManagedTypeId uniqueidentifier’,@ManagedTypeId=’EA99500D-8D52-FC52-B5A5-10DCD1E9D2BD’
What happens for cluster is that we have two HealthServiceShouldManageEntity relationships discovered, one from an internal discovery source (in my case with guid “85AB926D-6E0F-4B36-A951-77CCD4399681”), I would call it the standard one or the discovery source the sdk call SetProxyAgent works with, the other is discovered by the cluster management pack by the discovery rule Microsoft.Windows.Cluster.Classes.Discovery”. This discovery is implemented as a native COM in MOMModules.dll.
Then you must add that every cluster node disocvers the entire hierarchy of cluster resources (bad, bad, bad and huge load on every node for complex cluster implementations), so every cluster virtual server is managed by every single node. Let’s take a simple example a basic two node cluster (nodeA and nodeB) with just the cluster virtual server (CLUSTER). The Microsoft.Windows.Computer CLUSTER is discovered by both nodes so that we have that nodeA and nodeB have a HealthServiceShouldManageEntity relationship with CLUSTER with source Microsoft.Windows.Cluster.Classes.Discovery and another one with the standard disocvery source.
So what happens here is that every cluster node ShouldManage every cluster Virtual Server, for this reason the proxy assignment you see in console is just unpredictable. The selection is based on HealthServiceShouldManageEntity relationship which returns one row for each cluster node for each cluster virtual server.
Using SetAgentProxy from SDK (or change proxy in UI) is useless because it resets just the standard discoverysource and not the Microsoft.Windows.Cluster.Classes.Discovery one.
The root cause of your inhability to delete the virtual servers first and the the cluster nodes is related to this discovered relationship. So if you delete the relationship you’ll find you way home.
Here is a SQL snippet that given the cluster node name (FQDN) will delete all the relationships to cluster virtual servers:
declare @nodeHS nvarchar(255)
declare @nodeHS nvarchar(255)
Set @nodeHS=N’Microsoft.SystemCenter.HealthService:FQDN’
DECLARE Rel_Cursor CURSOR FOR
(SELECT [RelationshipGenericView].[Id]
FROM dbo.RelationshipGenericView
WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity()) AND (((dbo.[RelationshipGenericView].[IsDeleted] = 0))))
AND ([RelationshipGenericView].[SourceMonitoringObjectId]
IN (select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))
OPEN Rel_Cursor;
declare @relId uniqueidentifier
declare @discoSource uniqueidentifier
declare @now datetime
set @now = GETUTCDATE()
FETCH NEXT FROM Rel_Cursor INTO @RelId;
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @discoSource=DSTR.[DiscoverySourceId]
–SELECT *
FROM dbo.[DiscoverySourceToRelationship] DSTR
inner join dbo.[DiscoverySource] DS on DS.DiscoverySourceId = DSTR.DiscoverySourceId
inner join Discovery on Discovery.DiscoveryId = DS.DiscoveryRuleId
WHERE [DiscoveryName] = ‘Microsoft.Windows.Cluster.Classes.Discovery’
AND [RelationshipId]=@relId
AND DSTR.[IsDeleted] = 0
exec dbo.p_RemoveRelationshipFromDiscoverySourceScope
@RelationshipId=@relId,
@DiscoverySourceId=@discoSource,@TimeGenerated=@now
FETCH NEXT FROM Rel_Cursor INTO @RelId;
END;
CLOSE Rel_Cursor;
DEALLOCATE Rel_Cursor;
After executing this SQL statement the cluster node and virtual servers can be deleted from the Operations Console. Obviously this is totally unsupported. So do this at your own risk, but if you followed my analysis it is clear it should be safe.
Cluster discovery is noisy, this means too much 21025 events (i.e. agents reloads) and the way it is performed on every node can lead to serious impact and cluster nodes. While I was about to post the detail of the noisy discovery (starting with mac address mismatch with OS discovery) a new cluster MP has been released (6.0.6720.0), this one promises to solve a lot of the issue I detected in previous versions. So this one is strongly recommended. btw it adds support for Windows Server 2008 R2 failover clusters (yes CSV included).
- Daniele
This posting is provided “AS IS” with no warranties, and confers no rights.
USMT4 + SCCM 2007 SP2 RC – Downlevel Manifests folder is not present.
I encountered this problem while preparing a demo consisting in a migration from Windows XP to Windows 7. I was using USMT 4, MDT 2010 and SCCM 2007 SP2 RC. In my task sequence I used hard-links to capture user data from a Windows XP installation and then to restore data back after applying a Windows 7 Image previously created. The migration process completed successfully but no system component settings were migrated. I looked at the scanstate.log created in %systemroot%\system32\ccm\logs\SMSTSLog to see if something went wrong and I found the following line :
2009-10-17 17:19:44, Info [0x000000] Downlevel Manifests folder is not present. System component settings will not be gathered.
The error is related to a not present manifest folder, so I used FileMon lo verify in which location scanstate was looking :
Scanstate looked for that folder in %Systemroot%\system32, same location as the “Current Directory” of the process (as shown by Process Explorer).
By doing a simple search with Google I’ve found that I’m not alone and that other people experienced the same issue (http://systemcenterideas.com/2009/09/usmt-issues-with-mdt-2010). The fix proposed in that blog does not apply to me because I use a “Capture Task” and not a script to execute scanstate.
So while waiting to see if it will be fixed in RTM (I hope so) I needed a workaround.
I decided to create a wrapper that :
- executes a renamed scanstate.exe as a child process.
- passes the same parameters received from the command line to the child process
- passes the Application path as the Current Directory of the child process
- returns the child process exit code.
I used :
- GetCommandLine : to retrieve parameters passed to the wrapper
- GetModuleFileName : to get the application path (after eliminating the application file name)
- CreateProcess : to call the renamed scanstate and to pass the application folder as the Default Directory
- GetExitCodeProcess : to retrieve the child process exit code and to pass it to the task sequence.
I renamed scanstate.exe to _scanstate.exe and I putted my wrapper in the USMT x86 folder, naming it scanstate.exe.
As it can be seen in the previous picture, my wrapper calls the real scanstate application (named _scanstate.exe) as a child process and forces the “Current Directory” to be the same where the wrapper is located. With the correct “Default directory” _scanstate.exe is able to access the manifest folder.
This is a sort of hack, it is not supported an probably it is not the best solution to this problem, but in this case I’m running an RC version of SCCM 2007 SP2 and I’m only interested in have the demo working in the right way. I really hope that this will be fixed in the RTM version of SP2 or that if there is a supported alternative solution to this problem, that will be published soon (I know that I can copy manifest folder to the system32 before calling scanstate, but I don’t like to do that).
– Fabrizio
This posting is provided "AS IS" with no warranties, and confers no rights.
Strongly recommended non OpsMgr patches
I read several post about issues agents are facing even with R2. Yes R2 has still issues (clustering anyone?), but before pointing your finger at OpsMgr you should consider that a monitoring agent uses interfaces that are not normally used and this can lead to “new” bug discovered in OS or application components. So it’s not an agent issue, but the bug rises up only after agent installation. Bottom line: monitoring is never for free even agentlessly.
Before pointing your finger at OpsMgr this is our recommended fix list.
On every OS
On Windows Server 2008
- Service Pack 2
On Windows Server 2003
- windows scripting host 5.7 on Windows 2003 (and Windows 2000)
- KB 952523 on Windows 2003 to address a memory leak in WMI
- KB 931320 another issue with WMI on Windows 2003
- KB 943071 issue with event provider in managed code and WMI on Windows 2003
- KB 933061 it fixes several issues in WMI on Windows 2003, it is of great help with WMI issues even if it won’t resolve them all
On Biztalk 2007 servers (very noisy MP btw)
- you *must* follow what’s reported here http://msdn.microsoft.com/en-us/library/ee290753(BTS.10).aspx. You’ll find specific fixes to be applied other than the ones I report in this short list.
I’ll try to keep this post up to date with any new fix we’ll consider useful for agent health.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Publishing Operations Manager 2007 Web Console with ISA Server 2006 – Performance view problem
I came across this problem months ago but I didn’t post anything in this blog because I thought this isn’t a common scenario. Today I found a post on a Microsoft newsgroup with a guy searching help for this, so I decided to post an article with the solution.
If you publish an OpsMgr Web console by using a publishing rule in Microsoft Internet Security and Acceleration (ISA) Server 2006 using Forms-based authentication the following error may appear if you try to visit a performance view :
and at the same time an error will appear in the Eventlog of the server holding the Web Console Role :
Log Name: Operations Manager
Source: Web Console
Date: 10/10/2009 4:01:27 PM
Event ID: 10
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: OpsMgr-RMS.domain.lab
Description:
Instance: heraycn1hpnzfe45a0lvit45.
View request processing error:
Microsoft.EnterpriseManagement.OperationsManager.WebConsole.Utility.WebRequestArgumentException: Invalid format of the list of selected performance counters.
Parameter name: Counters —> System.FormatException: Guid should contain 32 digits with 4 dashes (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx).
at System.Guid..ctor(String g)
at ViewTypePerformance.GetSelectionsFromCookie()
at ViewTypePerformance.GetCountersList(String countersList)
— End of inner exception stack trace —
at ViewTypePerformance.GetCountersList(String countersList)
at ViewTypePerformance.ProcessViewRequest()
at ResultPaneBase.Base_Load(Object sender, EventArgs e)
It seems that the code is trying to get the list of counters to show, from a cookie an that an invalid or malformed GUID is found. The problem occurs only if you access the site through ISA so I supposed that ISA makes some modification to the request. To verify this hypothesis I captured a Network Monitor trace on the requesting client side and another one on the server side (on the server holding the Web Console Role).
The following picture contains a network packet fragment captured on the client side. In this fragment we can see a part of the cookie contained in the request :
The following picture contains the same network packet fragment captured on the server side. In this fragment we can see the same part of the cookie contained in the request :
The two frames are different, we could see that commas used to separate numbers in the color definitions are replaced in the second frame with semicolons. At this point it was clear that ISA replaced commas with semicolons in the cookie content and I thought this could be the cause of the issue.
I did a little research with google and I found the following KB that confirms my hypothesis:
“You publish a Web site by using a publishing rule in Microsoft Internet Security and Acceleration (ISA) Server 2006. When a user visits the Web site, the Web pages do not appear as expected. For example, the page layout may be incorrect, or parts of a Web page may not appear.
You experience this problem if the following conditions are true:
- You use Forms-based authentication (CookieAuth) in ISA Server 2006 to authenticate the users who visit the Web site.
- The Web site is running a Web application that uses one or more commas as part of the cookie content.”
I executed the script contained in the article on my ISA server to change this behavior and now I’m able to access the performance view without any issue.
– Fabrizio
This posting is provided "AS IS" with no warranties, and confers no rights.
Discoveries, multihoming and cookdown
We have a few customers that are using multihoming for opsmgr agents. They all complains for slow discovery in the added MG. I’ve been asked about this delay online as well, so I’m going to wrap my answers and give my view of the issue.
The slow discovery in the added MG, from my internal tests, it’s due to cookdown. Cookdown applies to every workflow, discoveries included. So let’s take for example a discovery of your own for Component C, that targets component B and that in turns targets Windows.Computer. Important: you’re discovering the components in both MGs. When you add a new MG to an agent the install process does a very basic discovery and restarts the agent, when the agent is restarted all the non synced discoveries are run. After the restart Windows.Computer is discovered for MG2, this will cause a reload (event 21025) for the specific MG, this in turn forces a download of all the workflows targeted at Windows.Computer. What we expect now is that newly downloaded discoveries (in our example for component B) run. But , since the discovery workflow for component B is already there for MG1 and that it has been run at agent startup after the new MG has been added, cookdown will step in and say “this is the same workflow, with the same signature so I can safely wait for the next scheduled time”. If the scheduling is 24 hours you must wait 24 hours (actually a little less) for component B to appear in MG2. And then the same process applies for Component C and so on.
So what you can do to speed up the discovery process for newly added MGs?
First, to check if this is really the issue you’re facing, restart the agent on one sample system and check if discovery data flows in, restarting the agent forces all non time synced discoveries to run. Between one restarts and another give the agent enough time (5’ to 10’ typically) to complete the discovery cycle.
Second, if you’re the MP author you should use the System.Discovery.Scheduler instead of the System.Scheduler datasource. This has been adopted by the latest OS MPs, so
Third, install the latest OS MPs at least version 6.0.6667.0.
For the curious of you, the difference between System.Scheduler and System.Discovery.Scheduler is only in the signature, in the latter the target managed entity Id has been added so it won’t be cooked down (the signature will always be different)
<DataSourceModuleType ID="System.Scheduler" Accessibility="Public" Batching="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>System.ExpressionEvaluatorSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element name="Scheduler" type="PublicSchedulerType" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Native>
<ClassID>C3339855-80B3-4c06-B7AB-5C5D97B59A0D</ClassID>
</Native>
</ModuleImplementation>
<OutputType>System.TriggerData</OutputType>
</DataSourceModuleType>
<DataSourceModuleType ID="System.Discovery.Scheduler" Accessibility="Public" Batching="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>System.ExpressionEvaluatorSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element name="Scheduler" type="PublicSchedulerType" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<DataSource ID="DS1" TypeID="System.Discovery.Scheduler.Internal">
<Scheduler>$Config/Scheduler$</Scheduler>
<ManagedEntityId>$Target/Id$</ManagedEntityId>
<RuleId>$MPElement$</RuleId>
</DataSource>
</MemberModules>
<Composition>
<Node ID="DS1" />
</Composition>
</Composite>
</ModuleImplementation>
<OutputType>System.TriggerData</OutputType>
</DataSourceModuleType>
<DataSourceModuleType ID="System.Discovery.Scheduler.Internal" Accessibility="Internal" Batching="false">
<Configuration>
<IncludeSchemaTypes>
<SchemaType>System.ExpressionEvaluatorSchema</SchemaType>
</IncludeSchemaTypes>
<xsd:element name="Scheduler" type="PublicSchedulerType" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
<xsd:element name="ManagedEntityId" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
<xsd:element name="RuleId" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
</Configuration>
<ModuleImplementation Isolation="Any">
<Native>
<ClassID>C3339855-80B3-4c06-B7AB-5C5D97B59A0D</ClassID>
</Native>
</ModuleImplementation>
<OutputType>System.TriggerData</OutputType>
</DataSourceModuleType>
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
AD MP – AD Trust Monitoring BUG
A few days ago a colleague of mine complains that after fixing a problem with a trust relationship between 2 domains, OpsMgr continued to show an alert in console even if the trust has been repaired. I told him to check if the alert was created by a monitor or by a rule, in case of a rule I told him simply to close it, in case of a monitor I told him to wait a couple of hours because I know that many OpsMgr monitors are scheduled, and the state change, isn’t immediate. After waiting one day, my colleague come back in my office and told me that the alert comes from a monitor and that nothing changed, the alert didn’t disappear form the console. I decided it was time to investigate.
First of all I retrieved the name of the monitor from the alert to check the it’s configuration :
The monitor is named “AD Trust Monitoring” and generates an alert named “A problem with the inter-domain trusts has been detected”.
I verified that the monitor is configured to auto resolve the alert when the monitor returns to a healthy state and that no override exists to change this behavior.
So I supposed that even the Health State of the monitor should be unhealthy. Confirmed :
After those check it was time to look inside the MP to verify how the monitor checks the status of the trust. The monitor uses a scheduled script datasource named “AD_Monitor_Trusts.DataSource” that connects to wmi namesapce “root\MicrosoftActiveDirectory” and queries Instances of class Microsoft_DomainTrustStatus :
Set oAllTrusts = GetObject("winmgmts:\" & strComputer & "\root\MicrosoftActiveDirectory").InstancesOf("Microsoft_DomainTrustStatus")
If 0 <> Err Then
bLogSuccess = False
ScriptError "failed to get all the trusts for this DC." & GetErrorString(Err)
Else
For Each oTrust in oAllTrusts
If ((oTrust.TrustType = 1) Or (oTrust.TrustType = 2)) And (oTrust.TrustStatus <> 0) And ((oTrust.TrustStatus <> 1786) Or Not bIsRODC) Then
strTrustErrors = strTrustErrors & FormatTrust(oTrust) & ", the error is:" & vbCrLf & _
oTrust.TrustStatusString & " (0x" & Hex(oTrust.TrustStatus) & ")" & vbCrLf
End If
Next
End If
The script checks the TrustStatus returned by the query. If a trust of type 1 or 2 has a TrustStatus not equal to 0, the variable strTrustErrors is populated with the string representation of the error (the expression ((oTrust.TrustStatus <> 1786) Or Not bIsRODC) is always true in the AD 2003 MP because bIsRODC is always false).
If the length of the variable strTrustErrors is greater than 0 a BAD state is returned by the datasource :
If Len(strTrustErrors) > 0 Then
Set oBag = oAPI.CreateTypedPropertyBag(StateDataType)
oBag.AddValue "State", "BAD"
This is confirmed by the documentation published by microsoft Trust Monitoring.
I used the utility wbemtest.exe to connect to the same namespace and to execute the same query, every trusts returned had a TrustStatus equal to 0, so it should not return a BAD state. Since the script could not return a “BAD” state I checked the UnitMonitorType AD_Monitor_Trusts.Monitortype to see when the monitor health is evaluated as Healthy. I found a condition detection that filter the output of the script, if the state returned by the script contains a substring “GOOD” the monitor returns an healthy state.
<ConditionDetection ID="FilterOK" TypeID="System!System.ExpressionFilter">
<Expression>
<RegExExpression>
<ValueExpression>
<XPathQuery>Property[@Name='State']</XPathQuery>
</ValueExpression>
<Operator>ContainsSubstring</Operator>
<Pattern>GOOD</Pattern>
</RegExExpression>
</Expression>
</ConditionDetection>
<RegularDetection MonitorTypeStateID="TrustsOK">
<Node ID="FilterOK">
<Node ID="ScriptDS" />
</Node>
</RegularDetection>
The last thing to check was when the script returns a GOOD state :
If bLogSuccess Then
Set oBag = oAPI.CreateTypedPropertyBag(StateDataType)
oBag.AddValue "State", "GOOD"
Here I found the problem, the “GOOD” state is returned only if the variable bLogSuccess is true. This variable is passed to the script as the 3rd argument :
bLogSuccess = CBool(oParams(2))
and is taken form the monitor configuration :
<CommandLine>//nologo $file/AD_Monitor_Trusts.vbs$ $Config/TargetComputerName$ false $Config/LogSuccessEvent$</CommandLine>
As you can see in the previous image, the default value of LogSuccessEvent in this monitor is False. With the default configuration the script never return a “GOOD” state and the state of the monitor never returns healthy unless you reset it manually.
After the creation of the override the alert disappeared form the console and the state returned healthy. To fix the problem you can override the value as I did to have the monitor behave as expected.
– Fabrizio
This posting is provided "AS IS" with no warranties, and confers no rights.