Archive for May 2009
Standard Module Composition for monitors
Hi all, this is our standard architecture for MP development, without any clear direction in official documentation, we built this standard to have a common schema to achieve:
- code reusability, we want to have custom code just in one place so we need to support it just once
- on demand detections for monitors
With this schema we have covered every single development scenario we got into with the notable exception of on demand detections for performance based monitors. This is due to the fact that standard libraries are missing a probe type for performance collection.
So, for any given data source we build:
- A probe type with input data, this is were custom code, if any, resides
- A trigger probe type without any input data based on the previous probe type
- A data source type based on the probe type and typically a scheduler
- A monitor type that uses the data source type and the trigger probe
- A unit monitor that uses the monitor type
- If needed a task that uses the trigger probe type
The sequence will become like this:
And the following is a snippet with the most important parts:
…<TypeDefininitions>
<ModuleTypes>
<DataSourceModuleType ID="DSType" Accessibility="Internal" Batching="false">
…
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules><DataSource ID="Sched" TypeID="System!System.Scheduler"> …
<ProbeAction ID="Probe" TypeID="ProbeType"> …
<Composition>
<Node ID="Probe">
<Node ID="Sched" />
</Node>
</Composition> …<OutputType>System!System.BaseData</OutputType> <!—here the appropriate data type must be specified –>
</DataSourceModuleType><ProbeActionModuleType ID="ProbeType" Accessibility="Internal" Batching="false" PassThrough="false"> …
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<ProbeAction ID="OLEDB" TypeID="System!System.OleDbProbe"> <!—here a valid probe action type id must be used—> …<Composition>
<Node ID="OLEDB" />
</Composition>
</Composite>
</ModuleImplementation>
<OutputType>System!System.OleDbData</OutputType>
<InputType>System!System.BaseData</InputType>
</ProbeActionModuleType><ProbeActionModuleType ID="TriggerProbeType" Accessibility="Internal" Batching="false" PassThrough="false"> …
<ModuleImplementation Isolation="Any">
<Composite>
<MemberModules>
<ProbeAction ID="Probe" TypeID="ProbeType"> …<ProbeAction ID="PassThrough" TypeID="System!System.PassThroughProbe" />
<Composition>
<Node ID="Probe">
<Node ID="PassThrough" />
</Node>
</Composition>
</Composite>
</ModuleImplementation>
<OutputType>System!System.OleDbData</OutputType> <!—here the appropriate data type must be specified –>
<TriggerOnly>true</TriggerOnly>
</ProbeActionModuleType></ModuleTypes>
<MonitorTypes>
<UnitMonitorType ID="MonitorType" Accessibility="Internal"> …
<MonitorImplementation>
<MemberModules>
<DataSource ID="DS" TypeID="DSType">…<ProbeAction ID="Probe" TypeID="TriggerProbeType"> …
<ConditionDetection ID="CD1" TypeID="…"> …
<ConditionDetection ID="CD2" TypeID="…"> …
<RegularDetections>
<RegularDetection MonitorTypeStateID="StateID">
<Node ID="CD1">
<Node ID="DS" />
</Node>
</Node>
</RegularDetection>…<OnDemandDetections>
<OnDemandDetection MonitorTypeStateID="StateID">
<Node ID="CD1">
<Node ID="Probe" />
</Node>
</Node>
</OnDemandDetection>…</UnitMonitorType>
Once all the appropriate types are defined the UnitMonitor and the Task (if needed) are straightforward.
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
How to get noisy discovery rules
Frequently changing properties and objects can be one of the causes of poor OpsMgr performance. In RC bits we saw optimizations on this topic, but still you must be aware that every configuration reload on the agents (21025 events but internal tasks as well) is going to tax your CPU. How much? It depends on MPs number, complexity and CPU power. I must admit I’m a little paranoid on agent health and resource utilization, but I do not want the monitoring infrastructure have a negative impact on business processes. Monitoring is great if it prevents down time not if it causes it, doesn’t it?
So, how can I check for bad MP behaviors. In a steady system newly discovered objects should be 0 or a few, discovered properties should register few changes. So the first question we must ask is which newly discovered objects are we getting and from which discovery rules. The second question is which properties are changing frequently and which are the related discovery rules. Initially I though to develop a powershell script, but I quickly changed my mind:
- powershell uses the SDK it can access just the live DB that typically contains a few hours or at most a few days or data
- I’d like to report on these changes so that I can have a weekly report in my inbox
So I turned to SQL queries against the data warehouse. I got a fairly accurate query for both questions. The only topic I must highlight is that the connection between the discovered objects and properties and the object or property itself is calculated based on the configuration part of the discovery rule:
<Discovery ID="Microsoft.SQLServer.2008.DBEngineDiscoveryRule.Server" Enabled="true" Target="Windows!Microsoft.Windows.Server.Computer" ConfirmDelivery="false" Remotable="true" Priority="Normal">
<Category>Discovery</Category>
<DiscoveryTypes>
<DiscoveryClass TypeID="Microsoft.SQLServer.2008.DBEngine">
<Property TypeID="SQL!Microsoft.SQLServer.ServerRole" PropertyID="InstanceName" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="ConnectionString" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="ServiceName" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="ServiceClusterName" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="FullTextSearchServiceName" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="FullTextSearchServiceClusterName" />
<Property TypeID="SQL!Microsoft.SQLServer.DBEngine" PropertyID="Version" />
but the <DiscoveryTypes> fragment of the discovery rules is not enforced, so we can have missing discovery rules. ON the other side the same class and property can be discovered by multiple rules, so we can a few more hits. But in any case this level of approximation is neglectable.
Discovered objects in the last 4 hours:
select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As ‘TargetTypeSystemName’, MET1.ManagedEntityTypeDefaultName ‘TargetTypeDefaultName’,
ME.Path, ME.Name,
ME.DWCreatedDateTime
from dbo.vManagedEntity ME
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query(‘data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)’) AS nvarchar(max)) like ‘%’+MET.ManagedEntityTypeSystemName+’%’
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ME.DWCreatedDateTime > dateadd(hh,-4,getutcdate())
Modified properties in the last 4 hours:
select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As ‘TargetTypeSystemName’, MET1.ManagedEntityTypeDefaultName ‘TargetTypeDefaultName’,
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query(‘data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)’) AS nvarchar(max)) like ‘%’+MET.ManagedEntityTypeSystemName+’%’
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-4,getutcdate())
Top discovery rule in the last 4 hours:
select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As ‘Changes’
from
(select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As ‘TargetTypeSystemName’, MET1.ManagedEntityTypeDefaultName ‘TargetTypeDefaultName’,
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query(‘data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)’) AS nvarchar(max)) like ‘%’+MET.ManagedEntityTypeSystemName+’%’
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-4,getutcdate())
) As #T
group by ManagedEntityTypeSystemName, DiscoverySystemName
order by count(*) DESC
and this is a sample output from my environment… I must work on the DPM MP
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
OpsMgr 2007 R2 – lessons learned
Now that OpsMgr 2007 R2 has entered RTM state we can share some of the lessons learned as early adopters. As many of you know our environment is not huge, we’re monitoring a few hundreds servers using gateways as a their primary point of contact, nevertheless we need to monitor three generations of software, windows 2000 thourgh 2008, SQL 2000 through 2008 and so on. We monitor VMWare and Oracle and soon we will be called to monitor *nix systems. Add to these specific and self developed management packs, we reach about 200 MPs deployed in production. This number of MPs has its challenges on its own.
As soon as we evaluated R2 in our pre-production environment it has become clear it was a must have, much more stable than SP1 event in RC code. So we moved into production thanks to our RDP participation. This decision had the net effect of getting us in a very busy period that started with MMS and it’s still running. Anyway it was worth so you must not be scared if I’m going to share attention points, other bloggers will tell you how cool is R2 and the fact that we moved to R2 is here to testify it is a huge step forward.
So, this is what we learned:
- R2 upgrade is a one way ticket, side by side migration is an option but if your OpsMgr infrastructure is mature and you’re using the data warehouse this is not a way you want to go. Inplace upgrade is risky and one way, even if you can technically step back (from backup copies) after a few hours you’ll find yourself in a situation where you cannot afford to lose the monitoring data you collected. So you better be prepared for it.
- the upgrade guide is good enough to drive you in the preparation steps but I would advice to put a good agent monitoring plan in place before upgrading, these are our standard checks (I will return on them with future posts)
- HealthService CPU Usage
- MonitoringHost CPU Usage
- Agent restarts (you can collect event ID 102 from OpsMgr Event Log) These can be caused by agent crashes or by the standard monitor on agent resource (private bytes and handles)
- Agent configuration reloads (i.e. 21025 events)
- Actual agent communication, heartbeat is not enough, we had heartbeating agents that were not uploading performance and event data
- Frequent discovery changes
- after upgrade we measured a noticeable increase in healthservice CPU usage both on RMS, gateways and agents. It is not still clear why we had this, given the fact we didn’t change our monitoring baseline, probably this is due to the fact that now rollup and dependency monitors are working in a more reliable way. On the other side we have about 20 21025 events on the RMS per hour, 21025 events force the reparsing of all the MPs and make the RMS call out for “unknown” monitors state. This has an agent impact. On the agent side the cpu usage on average moved from 3-5% to 5-8%. Most affected agents:
- Domain Controllers
- Cluster nodes
- VMs hosted by Virtual Server
- We had a couple of run away agents (CPU usage above 50% on average) on cluster nodes multihomed with a SP1 Management Group. To get rid of them we had to reset the healthservice cache (i.e. delete the Health Service State directory)
- MPs continued to work as expected
- R2 new features are generally working ok
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.
OpsMgr R2 Authoring Console – Business Hours monitor
The new authoring console that will be available with R2 has pros and cons (see my previous post) but for sure can be used to simplify the creation of basic monitors that must be run during business hours.
The process is not intuitive, you must know how to build such a monitor, but nevertheless can be of great help when XML is seen as too awkward.
I want to follow up to my previous posts on running a specific monitor during business hours showing how to achieve this via the Authoring Console. So, let’s take into account the scenario in which I want to monitor a LOB app CPU utilization only during business hours.
A word of warning the MP we’re going to build is far from being complete, for example our monitor will not allow overrides nor we’re going to implement OnDemandDetections. Based on feedback and time I can build on that in future posts, but I’m not promising anything.
From 1,000 feet this what need to accomplished:
- Create a new MP using the Registry App from the Authoring Console. This will create for us a new class and an appropriate discovery
- Create a composite monitor type. This is the trickiest part.
- Create a monitor based on the previous type.
So first step, in the AC let’s create our new MP:
Now remember to:
- change the default discovery interval to at least 1 hour (the default is just crazy, 15 sec). As a best practice consider using at least a 4 hours interval in discoveries.
- define an appropriate registry path to discover the app, in my example I will just test for a registry key existence for my mock application TestApp.
Finally let’s add the proper filter
Now your MP is able to discover the LOB application based on registry key existence. You can import it and check if the application is being discovered, the easiest way to do this is to use the Discovered Inventory view in the monitoring workspace of the Ops Console.
Now comes the trickiest part, to build a monitor that runs only during a specific time window we must build a composite monitor with a workflow like this: a scheduler (our time window) –> an action probe –> at least one condition detection for every status we want (2 or 3 actually. Strictly speaking we can have a state for the undetected condition, but this will just add complexity to our example). Since we’re using a perf counter we have to slightly change our workflow, in fact we do not have probe performance providers, just data sources. One of the way to overcome this is to change the composition like this: performance data source –> scheduler filter condition detection (our time window) –> at least one condition detection for every status we want.
So let’s go to work.
1) create a new composite monitor type and give it a name
2) define the monitor states, just 2 in our example
3) Add and configure the performance data source
4) Ad two condition detections one for each monitor state. Since the monitor is based on performance it’s a good idea to use an Average Condition over a given number of samples. Let’s say our TestApp is ok if it uses less than 20% CPU time.
5) Add the Scheduler Filter this is where we’re going to define our time window or business hours. Here we must overcome a limitation in the current build of the Authoring Console. Since we’re talking about business hours is highly probable we need to consider the server local time, to do this we need to change the TimeXPathQuery property (useless in our scenario) with UseCurrentTime. We cannot do it directly form the authoring console so we must hit the “edit” button and change the resulting xml from <TimeXPathQuery>TimeXPathQuery</TimeXPathQuery> to <UseCurrentTime>true</UseCurrentTime>.
6) Use the configure feature to define via the UI the business hour period, in this example every day from 7:00 AM to 8:00 PM.
7) Now we have all the modules that we need to build our monitor type. The pictures shows all the modules you must have at this time. We just miss the workflow.
8) Define the regular detections for both the monitor states
9) eventually define the unit monitor in the health model
10) define the monitor type
11) associate the monitor type states to the monitor state and that’s it
– Daniele
This posting is provided "AS IS" with no warranties, and confers no rights.