Archive for June 2009
Troubleshooting 21025 events – part 2 the RMS case
As Daniele pointed out in the article (Troubleshooting 21025 events – Part 1 evidence) 21025 events on the RMS are bad. In one of our SCOM 2007 R2 environment we observed an high number of those events on the RSM and an impact on the scalability of that infrastructure. So I decided to try to investigate the cause of this unusual number of 21025 events on that specific RSM. I started looking at OpsMgrTrace but with more than 450 agents it was difficult to understand what was going on because of the size of the trace. Fortunately in that environment agents communicate with RMS trough 10 gateways, so i stopped 9 gateways having only 15 agents active during my tests. Looking inside the trace I found the line in which OpsMgr Config Service forces RMS Health Service to Update his configuration :
[ConfigService] [] [Information] :HealthService.HealthServiceState.OnDirty{healthservice_cs1253}( 0000000000C1324E )Logging event 29102 from source OpsMgr Config Service with severity Information and description Configuration state of OpsMgr Health Service {e365284c-48ac-b6d5-0138-0222a107ac46} running on myrms.mydomain.mycountry may be out of date. It should contact OpsMgr Config Service to synchronize its configuration state.
[ConfigService] [] [Information] :StateManager.CollectDirty{statemanager_cs39}( 0000000001F5D7C3 )State=e365284c-48ac-b6d5-0138-0222a107ac46 is dirty
some lines above, I found an entry that seems interesting :
[Bid2Etw_Mom_Dal_1_Trace] [] [Information] :{Information_TextW} Query definition name = ConfigServiceNotification command text = dbo.p_ConfigServiceNotifications
[ConfigService] [] [Information] :ConfigurationChangeSetProvider.UpdateQueryTimestampFromResults{configurationchangesetprovider_cs595}( 0000000003BFFBB0 )Timestamp = 6/28/2009 9:35:49 PM.
[ConfigService] [] [Information] :DatabaseAccessor.NotifyOnChanges{databasenotification_cs329}( 0000000000BF3DF8 )Firing change notification.
I stated looking at the stored procedure dbo.p_ConfigServiceNotifications and IMO it is used to find if something is changed in the environment (Discovery data, ,Relationship, MPs, Rules , Monitor targeted types, Credential Manager Secure Storage, Secure StorageSecure Reference) that need to be notified to the Config Service. I used SQL Profiler to trace the call to this SP and I found that is executed every 30 seconds with parameters similar to the followings :
declare @p3 bit
set @p3=1
exec dbo.p_ConfigServiceNotifications @IsBootStrap=0,@TimeOfLastNotification='2009-06-28 20:44:04:287',@IsDataReliable=@p3 output
select @p3
I used that stored procedure to check if I was able to understand what was changing in that environment that creates an event 21025 on the RMS.
A sample Output of the stored procedure looks like the following :
ManagedTypeId TopHostingEntityId KeyValue
------------------------------------ ------------------------------------ ----------------------
D447F983-507B-46C4-ED1B-D467E134A863 1678A4CA-3406-408E-D4E9-D60F0D77A139 pre.lab (DC02.pre.lab)
I checked with the following SQL statement who hosted the Class (it is confirmed that was not the RMS)
select FullName from dbo.BaseManagedEntity where BaseManagedEntityId = '1678A4CA-3406-408E-D4E9-D60F0D77A139'
FullName
—————————————
Microsoft.Windows.Computer:DC02.pre.lab
I checked with the following SQL statement what class was changed :
select TypeName from dbo.ManagedType where ManagedTypeId = 'D447F983-507B-46C4-ED1B-D467E134A863'
TypeName
———————————-
Microsoft.Windows.DNSServer.Library.Zone
What I found is that some specific classes (DNS Zones, IIS Web Sites, etc) hosted by some servers (not hosted by RSM) generate a 21025 on RMS every time a class property change occurs. I did tests in the past and I observed that a change in a class property generates an event 21025 on the HealtService that hosts the class not on the RMS, so why those specific classes are different from others?
One of the class that generates event 21025 on RSM is a class in a simple custom management pack, so I started to compare that MP with MP that I used in the past to troubleshoot this kind of behavior and I started to remove monitors, rules, etc.. in that MP to understand what is the source of the problem. After some test I removed from that MP a Group that contain al instance of that specific class, without It no 21025 events were generated when a property change of that class occurred.
To verify my hypothesis I created a group in the MP that never generated a 21025 on RSM, and it started to generate them on RMS at every property change on the class contained by the newly created group. I verified the existence of a group that contain DNS Zones and a group that contains IIS Web Sites, so now it is clear why a change on them creates a 21025 on RSM.
In conclusion it seems that if a property of a class changes and the instance of that class is member of group, an event 21025 is created on RMS.
I will update this post with a sample MP that contain a class, a discovery and a group to simulate in lab this behavior.
– Fabrizio
This posting is provided “AS IS” with no warranties, and confers no rights.
OpsMgr 2007 PropertyBags to type or not to type
As a part time MP developer I’m always struggling to make my rules as optimized as I can. I posted about cookdown and the way to leverage it to make monitoring as light as possible. In this respect the optimum data source runs just once regardless of the number of instances of a class monitored by a specific HealthService. A customer of ours asked me to build a MP to monitor mailbox directories (or drop directories if you prefer). It needs to know if the content is up to date and have a check on the size of each folder, it needs to report on the folder size grow over time. This translates into two monitors (one for age and the other for the size) and a performance collection rule for the folder size. On a specific server I can have several mailbox directories and I wanted to run my script just once regardless of the number of them. Long story short I wrote one script (read Probe) that does it all: it returns the data used by both monitors and the performance data points as well.
I though It was a good idea to differentiate my returned property bags, so I started returning two typed property bags: state and performance. The key to do this is to use the CreateTypedPropertyBag method documented in the OpsMgr SDK. The possible types can be:
| Value | Definition |
|---|---|
|
0 |
Alert Data |
|
1 |
Event Data |
|
2 |
Performance Data |
|
3 |
State Data |
Returning to my data source I coded something like the following:
Set oBag = g_API.CreateTypedPropertyBag(StateDataType)
call oBag.AddValue("Path", f.Path) ‘key property we filter on
call oBag.AddValue("Name", f.Name)
call oBag.AddValue("SizeMB", Round(f.Size/1024/1024,2))
call oBag.AddValue("AgeHR", datediff("h", LastDate, now))
g_API.AddItem oBag
Set oBag = g_API.CreateTypedPropertyBag(PerformanceDataType)
call oBag.AddValue("Name", f.Name)
call oBag.AddValue("SizeMB", Round(f.Size/1024/1024,2))
g_API.AddItem oBag
Then I coded my monitor and the performance collecting rule, using the appropriate condition detections. Analyzing the MP behavior I noticed that both property bags (state and performance) are processed by each and every condition detection. I should have think of it from the start, anyway I thought this was a suboptimal solution, and though by myself that since I’m using exactly the same data I should have returned just one property bag. I recoded the data source as follows:
Set oBag = g_API.CreateTypedPropertyBag(StateDataType)
call oBag.AddValue("Path", f.Path) ‘key property we filter on
call oBag.AddValue("Name", f.Name)
call oBag.AddValue("SizeMB", Round(f.Size/1024/1024,2))
call oBag.AddValue("AgeHR", datediff("h", LastDate, now))
g_API.AddItem oBag
with my surprise the performance collecting rule stopped to work. It took me sometime to get to the culprit, but at that time I didn’t have the latest version of the SDK, in fact it states:
“With the current release of Operations Manager 2007, the only modules that use the conversion type check are the mapping modules that perform conversion to event and performance data.”
Since I’m returning a State Data Type the performance mapper discarded my property bag. Eventually things started to be clear, so I have two choices:
- Return two typed property bags, and accept the fact that the condition detections will process each of them. The performance mapper will get instantiated and then just discard the state bag, the monitor condition detections will need to take into account the fact they will get multiple bags for the same monitored instance (i.e. you need to tailor the xpath query accordingly).
- Return just one untyped property bag
To return an untyped property bag it is sufficient to change the CreateTypedPropertyBag method call with CreatePropertyBag
Set oBag = g_API.CreatePropertyBag()
call oBag.AddValue("Path", f.Path) ‘key property we filter on
call oBag.AddValue("Name", f.Name)
call oBag.AddValue("SizeMB", Round(f.Size/1024/1024,2))
call oBag.AddValue("AgeHR", datediff("h", LastDate, now))
g_API.AddItem oBag
In conclusion, it is really up to you the decision about using typed or untyped property bags, just be aware of the pre-defined filtering on event and performance data mappers.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Tired of WMI Probe Failed Execution?
If so, just apply KB 933061 (http://support.microsoft.com/kb/933061) in my environment I had a 100% success rate. Strongly recommended. btw these are real issues that can impair your monitor.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Troubleshooting 21025 events – Part 1 evidence
I’m still struggling with what I consider abnormal CPU usage for my agents and RMS, I found out I’m not alone (is-your-rms-updating-configuration-too-frequently.aspx) so I think it’s time to share what I discovered from my investigations in OpsMgr 2007 R2 RTM code. This first post will focus on 21025 events we talked about them in the past (Class properties that get updated frequently is a WORST PRACTICE not only for RMS) and recently (How to get noisy discovery rules). A word of caution the following are my observations, I don’t have access to the source code nor to debugging symbols, even if I double checked every sentence I can be wrong and surely I cannot cover every possible scenario. So what are OpsMgr Connector 21025 events?
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
21025 events are logged every time the HealthService needs to update its configuration, possible causes:
- a new MP is imported and has workflows that target one of the instances discovered on the HealthService
- an override has been defined on a workflow that targets one of the *classes* discovered on the HealthService
- a new instance has been discovered or an already discovered property changes its value (it takes about 50” after the config change to have a 21024 event see below)
- other, I’m still investigating 21025 causes on the RMS that seems not to be related to one of the above causes
With instance I mean a specific discovered object, say Logical Disk C:. With class I mean a generic object type, say Logical Disk. So from the above causes we can infer, among the others, that:
- If I import a new MP with discovery workflows targeted to Windows.Computer all my agents will get this and fire a 21025 *even if* the discovery rules are disabled. We can assume every agent has a Windows.Computer instance in its config space.
- If I enable a discovery rule for one of my Windows.Computer (a specific instance) never the less all the Windows.Computer will get the override and fire a 21025
** room for optimizations Marius :-) **
Remember RMS “hosts” non hosted classes and watchers so any change to those will fire a 21025 on the RMS, but this is just a particular case of cause 3.
With R2 I had no evidence that a changed discovered property on a single agent, nor a new object instance discovered, would cause a 21025 on the RMS. I run a complete test suite on the topic that lead me to the above conclusions, and *never* a changed property or a newly discovered *hosted* instance has caused a 21025 on the RMS. Once again I don’t pretend to have covered every single scenario.
But why are 21025 bad? As Steve addresses in its blog post 21025 have the bad habits of taxing CPU. From my observations 21025 on agents can be of different types in terms of resource utilization:
- 21025 caused by a new MP or Override that loads just the affected MP (since these load just one or a few MPs they’re not loading the system, obviously if do not import and override every minute)
- 21025 caused by a new instance or a property change, in R2 these reload the config without reparsing the MPs, they just update the OpsMgrConnector.Config file (once again very low impact, obviously if your discovery is not running every minute)
- 21025 caused by ??(@!@@!*!), these reparse a lot of MPs causing prolonged CPU spikes, single core machines or old one are especially affected by these
With MP reparsing I’m referring to the process of reparsing and reloading MPs from the local cache (%programfiles%\System Center Operations Manager 2007\Health Service State\Management Packs).
21025 on the RMS are worse, in addition to the above they fire an internal task towards all agents involved in distributed roll up monitors for which the RMS thinks it hasn’t up to date state. This internal task will not log a 21025 event on the agents, but will cause a complete MP cache reload. This behavior has been introduced by KB 958490 (Patching blues – QFE 958490 pumps up HealthService CPU Usage on agents) and confirmed with some notable fixing in R2 (with R2 it hasn’t the bad and diffuse effect described in my post). So if, for any reason, you have agents that are not able to calculate a state involved in a distributed workflow, they are impacted at every 21025 on the RMS. I have 10 of these on about 300 agents deployed.
My investigation still misses a key part, the extra causes (cause 4) for 21025 on RMS, I’m working on it with the invaluable help of the product team (that still keeps too many secrets :-)) and I’ll keep you up to date on any development and hopefully with a complete list of causes and fixes. If anyone is reading and has more info to add I’m here to listen. :-)
Finally I want to document the expected sequence of events that lead to a 21025 on the agent side:
- 21024 event asking for up to date config to the MS
- 21025 event if the agent is not up to date, 21026 if it is up to date (and here we stop)
- a 7023, 7025, 7024, 7028 events sequence for every management group (typically just one but possibly up to 4) These are related to runas accounts
- if the config change involves new MPs then we have a 1200 followed by one 1201 for every MP downloaded
- finally we have a 1210 that states that the new configuration has been loaded
Watch out for 21025 not followed by 1210 this can indicate corrupted configuration has been downloaded.
This is all for this first post, stay tuned for more info on the topic.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
Are OpsMgr 2007 R2 ACS reports broken?
In Italy we have a new law that imposes auditing of administrative logons. ACS is the right solution and it can be integrated with log from heterogeneous systems directly (H2 2009 on the xplat supported platforms) or via Secure Vantage syslog gateway (if the target platform is able to deliver security logs through syslog). We have several customers that are using ACS but every one of them developed custom reports so they’re not using out of the box report. Since we had no issue report we assumed everything worked fine. One of the out of the box reports in ACS addresses one of the law requirements: “Usage_-_User_Logon”. Alas when I tried to use the report for Windows Server 2008 servers it didn’t return any data, even if eventid 4624 (logon on in Windows 2008) is included in the report filter. It turned out the developer added the eventid 4624 but completely missed the fact that the logging in user is registered in the TargetUser property rather than the Primary User property as it was in 2003.
Once again, guys: if it is not tested it doesn’t work. :-)
I don’t know if other reports are affected by the same issue, I just attach the corrected report in case anyone has hit the same issue. Usage%7C_-%7C_User%7C_Logon.rdl
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
OpsMgr 2007 R2 – OleDB module bug
The OleDB module in OpsMgr 2007 R2 still have issues dealing with datetime or nvarchar(max) fields. Even if I reported the issue during the RC evaluation it’s still there. Since I spent a lot of time troubleshooting it I’d like to share the workaround with the community.
Issue: OleDB module returns datetime and nvarchar(max) fields as NULL (or empty I didn’t check the exact value returned).
Let’s take the following query, it gets the last event collected by every HealthService:
select ME.Path As [Path], LoggingComputer, MAX(TimeGenerated)As ‘LastEvent’,
CASE WHEN Isnull(MAX(TimeGenerated),’01-01-80′) < DateAdd(hh,-2,getutcdate()) Then ‘KO’ Else ‘OK’ END
from dbo.ManagedEntityGenericView ME WITH(NOLOCK)
inner join dbo.ManagedTypeView MT WITH(NOLOCK) on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’
left join dbo.EventView WITH (NOLOCK) on LoggingComputer = Path and TimeGenerated > dateadd(hh,-8,getutcdate()) or LoggingComputer=SUBSTRING(Path,1,CHARINDEX(‘.’, Path)-1)
where IsDeleted=0
group by Path, LoggingComputer
We can build a task who executes the query using an System.OleDbTriggerProbe. The result will be like this (notice the first and third columns are missing values):
The only workaround I found is to cast the fields to a limited nvarchar, say nvarchar(255). Changing the query like this:
select CAST(ME.Path as nvarchar(255)) As [Path], LoggingComputer, CAST(MAX(TimeGenerated) as nvarchar(50)) As ‘LastEvent’,
CASE WHEN Isnull(MAX(TimeGenerated),’01-01-80′) < DateAdd(hh,-2,getutcdate()) Then ‘KO’ Else ‘OK’ END
from dbo.ManagedEntityGenericView ME WITH(NOLOCK)
inner join dbo.ManagedTypeView MT WITH(NOLOCK) on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’
left join dbo.EventView WITH (NOLOCK) on LoggingComputer = Path and TimeGenerated > dateadd(hh,-8,getutcdate()) or LoggingComputer=SUBSTRING(Path,1,CHARINDEX(‘.’, Path)-1)
where IsDeleted=0
group by Path, LoggingComputer
We get the complete output:
Obviously this is far from perfect, for example a datetime must remain a datetime to be able to manage it in monitors. Alas this is it at this time, hopefully someone in the OpsMgr team will fix this soon or later.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
IsVirtualMachine property not discovered?
One customer of ours reported me what seemed a weird issue. The IsVirtualMachine property of the Windows Computer class seemed not to be discovered.
To a further analysis we got evidence of the following:
- all virtual machines are correctly identified
- physical hosts sometimes discover the property (set to false) sometimes not
The property is discovered by two distinct discovery workflows defined in Microsoft.SystemCenter.Internal. Both are based on a WmiProviderWithClassSnapshotDataMapper type that uses a WMI query similar to the following: select * from Win32_Baseboard where manufacturer = “Microsoft Corporation”.
It turned out that physical servers often don’t have any instance for this WMI class, so the WMI discovery doesn’t return any match and in turn the IsVirtualMachine property it is not discovered.
Two issues here:
- WMI is far from being reliable, it is the first failure cause for SCCM deployments as well. Microsoft needs to get it right in reliability and performance.
- The discovery is poor since doesn’t take into account the case of Win32_BaseBoard returning no instances
To address the issue we have just two ways:
- Base any logic on IsVirtualMachine <> true to select physical hosts
- Rewrite the discovery
At this time the former is good enough.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.