In the first two installments Fabrizio and I dug into 21025 events. This final post is a wrap up of what we discovered and the actions we advise. This is the last post on the argument and recaps from the previous two:
You’ll find all the details in the previous posts, here I’m going just to share a quick recap and some actions we put in place.
First of all, what does a 21025 mean?
A 21025 gets recorded every time the health service needs to load new configuration. With “new configuration” you must assume everything from a newly deployed management pack, a new override, a new entity discovered a discovered property value change. In Part 1 I listed all the causes I could get, in Part 2 Fabrizio shed the light on 21025 on RMS. On RMS 21025 are fired for every property changed on non hosted classes and for every property changed in any first level group member. Here I need an example to clarify: if I have a group that contains logical disks and the size of a logical disk changes I will have a 21025 on the RMS. If the group contains Windows Computers that in turn are hosting logical disks, and any logical disk property changes I will *not* have a 21025 on the RMS. i.e. logical disk are not “first level” members of my group in this second case.
Why should I care about 21025?
When an agent reloads a MP it needs to reparse it from XML this is CPU intensive, this will have an impact on your monitored servers. Then if there’s any workflow associated with the modified properties it will re-run, and this can put another burden on your agents (see case-study-fixing-a-discovery.aspx).
When 21025s are on RMS, they’re especially bad. RSM has a lot to do and having a 21025 every few minutes seriously limits RMS scalability and reliability, worse every 21025 on the RMS will cause a recalculation of every “unknown state” contributing monitor for rollups, this will fire an internal task that will have an effect very similar to 21025 (even if a 21025 is not recorded) on every agent that hosts such a monitor.
What can I do to limit 21025 frequency?
Make modifications in batches:
- import new MPs together (remember to try them first in your lab)
- plan your override strategy and apply them in batches if possible. If using groups you should be sure the target class hasn’t bad discoveries (i.e. discoveries that frequently change properties)
Do not import MPs you’re not going to use/care about just because are available (a common error I see in many deployments).
Identify bad discoveries (see my previous post How to get noisy discovery rules) and:
- rewrite them to be quieter (this is expensive)
- -or- tune them to run just once a day or so
Identify if bad discoveries are hitting non hosted classes or classes that have first level membership in groups, these are especially bad because will cause a 21025 on RMS. In this case, if possible, tune the discovery to run once or twice a day and in sync. If you have created the groups try to recode them to target a quieter hosting class, alas this is not always possible or viable.
Applying these simple rules we’ve been able to reduce the 21025s per hour on our RMS from 25/30 to 10/15 and we plan to work to reduce them even more. It would be nice if MS would deliver a throttling mechanism for the healthservice so that we can tune how often a given agents can reload… this could probably be a simple mod to the code in the wait for more optimizations in vNext.
I must add that running discoveries in sync is an option that must be thoroughly evaluated, running in sync means that you will have just one 21025 on your RMS, while running them once a day but not in sync means that you can possibly have as many 21025 per days as the instances of the class discovered. On the other hand running them in sync means the RMS and the DB will be hit with data from all the agents at once. So there’s no clear winner here, even if I prefer to deal with a one shot overload of the RMS than a reiteration of 21025s.
Finally, to help identify non hosted classes and groups populated with noisy classes, I post two sql queries. The first one is a slight modification of the one you can find in “How to get noisy discovery rules” with the hosted property added so you can check if the class is going to hit the RMS, the latter, given a noisy class, returns all the groups populated with it. The query takes into account class hierarchies to return a good estimate. Remember all the queries are targeted to the data warehouse since the live DB doesn’t have an history of changed properties.
A word of caution here, these queries for their very own nature are an approximation. They are designed to give clues and guidance. In the search for groups with first level entities I had to make assumptions that lead to the chance that some false positive can be returned among the groups . This is due to the fact that I must follow the class hierarchy. So use the results with common sense.
For more information on where a class is managed I would suggest the following Marius post “Where is thy instance monitored and how that affects dependency monitor state?”
/* Top Noisy Rules in the last 24 hours with hosted property */
select ManagedEntityTypeSystemName, DiscoverySystemName, Hosted
, count(*) As ‘Changes’
MET1.ManagedEntityTypeSystemName As ‘TargetTypeSystemName’, MET1.ManagedEntityTypeDefaultName ‘TargetTypeDefaultName’,
C.OldValue, C.NewValue, C.ChangeDateTime
,METMP.HostedInd As Hosted
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
inner join dbo.vManagedEntityTypeManagementPackVersion METMP on METMP.ManagedEntityTypeRowId=MET.ManagedEntityTypeRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query(‘data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)’) AS nvarchar(max)) like ‘%’+MET.ManagedEntityTypeSystemName+‘%’
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())
) As #T
group by ManagedEntityTypeSystemName, DiscoverySystemName, Hosted
order by count(*) DESC
/* Group directly based on a specific class */
declare @ClassRowId int
Select @ClassRowId = ManagedEntityTypeRowId from dbo.
where ManagedEntityTypeSystemName = ‘Microsoft.Windows.DNSServer.Library.Zone’
select distinct MET.ManagedEntityTypeSystemName,
MET2.ManagedEntityTypeSystemName As ‘GroupSystemName’, MET2.ManagedEntityTypeDefaultName ‘GroupDefaultName’
, MP.ManagementPackSystemName ‘MPSystemName’, MP.ManagementPackDefaultName ‘MPDefaultName’, DS.Name ‘GroupName’
(select distinct ManagedEntityTypeRowId from dbo.ManagedEntityDerivedTypeHierarchy(@ClassRowId, 5)
select distinct ManagedEntityTypeRowId from dbo.ManagedEntityBaseTypeHierarchy(@ClassRowId, 5)
inner join dbo.vManagedEntityType MET on
inner join dbo.vDiscoveryManagementPackVersion MPV on CAST(DefinitionXml.query(‘data(/Discovery/DataSource/MembershipRules/MembershipRule/MonitoringClass)’) AS nvarchar(max)) like ‘%’ + MET.ManagedEntityTypeSystemName + ‘%’
inner join dbo.vManagedEntityType MET2 on MET2.ManagedEntityTypeSystemName = CAST(DefinitionXml.query(‘data(/Discovery/@Target)’) as nvarchar(max))
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET2.
left join dbo.DisplayString DS on DS.ElementGuid=MET2.
– DanieleThis posting is provided “AS IS” with no warranties, and confers no rights.