On Demand Detections caveats (a cookdown story)


I’m a big fan of On Demand Detections, but in the past few weeks a note from Raphael Burri and then this thread (http://www.systemcentercentral.com/tabid/60/indexId/21097/tag/Forums+MP_Development/Default.aspx#vindex21461) on System Center Central made me wonder if I’m right on this topic.

In this post I’m going to explain On Demand Detections (ODDs) and their relationships with cookdown that leads to overall agent performance impact on monitored systems.

For those who are new to ODDs, they’re the way to ask a monitor to recalculate its state. As you know many monitors are scheduled base, in the sense they check the state every  x minutes/hours, for these monitors what you see in console is not the real time picture of what’s going on. It’s just the picture of the last monitor run. ODDs are the way you can ask a refresh of the picture without waiting fot the next polling cycle. If a monitor has ODDs it will respond to “Recalculate Now” in Health Explorer, if not the button is useless (yes, I’ve already ask the team to gray out the button if the monitor hasn’t ODDs).

image

If present ODDs are called when a managed entity exits from maintenance mode or is initialized at agent startup. This is why I consider ODDs a strategic part of every scheduler based monitor. Finally I must you need one ODD for every monitor state, so typically two or tree ODDs per monitor.

As usual, when developing a MP and defining the monitor strategy, it’s fundamental to consider what will happen on the agents and if the load we’re going to add is acceptable or not. This is especially tru if we’re going to monitor multiple instances of our monitored class on a single agent. For example if I write a monitor for a SQL Database I know I will have many of them on a SQL box or if I write a monitor to check folders size or age I know I can potentially have several of them for single agent. Let’s think of a script based monitor that targets a SQL database, if I have 50 dbs I will launch 50 scripts all together, right? Not exactly, here comes to help “cookdown”. Cookdown is an agent functionality that consolidates identical data sources into one. So if I properly write my monitoring script I can have it run just once for all my 50 dbs. In this case the script will need to calculate the state for all dbs and return a property bag with all the states, the monitor will have a filter to grab just the state for the db it is targeted to. Given the level of my english it is harder to explain than to do it. Anyway what we need to remember is, with properly written data sources, I can limit the load on my agents executing the script just once regardless of the number of instances monitored.

What Raphael told me basically was  “cookdown doesn’t work for ODDs”, but I was sure I had tested them, so something was not clear to me. Marius, in his post, confirmed that cookdown for ODDs would work just per instance. So if I have a monitor with 3 ODDs, accordingly to Marius, it will run just once, but If I have a monitor with 3 ODDs targeted to 10 instances, it will run 10 times, one for each instance. If cookdown wouldn’t be in place (as it was with SP1) I would have had 3 x 10 = 30 runs.

Since I have many MPs with ODDs it was time to clear things up, I’m going to share what I’ve found with you.

 

The testing case:

  • one script based probe (P1). Script will be referenced as S1.
  • one trigger probe composed with P1 (T1)
  • one datasource (DS) composed with a scheduler and P1
  • two monitors using DS for standard detection and T1 for on demand detections (2 detections defined)
  • one collecting rule composed with DS and a perf mapper condition

The monitors are targeted to Class C, the environment has 3 instances of class C managed by one Healthservice (HS).

The two monitors and the collecting rules are time synced and have the same recurring schedule.

 

What I was expecting:

  • One run of S1 at synced time for the two monitors and the rule
  • One run of S1 at agent startup caused by ODDs
  • One run of S1 when exiting maintenance mode

 

Observed behavior:

  • at HS initialization one instance of S1 is run for every monitor and every instance. So we have 6 runs (2 monitors x 3 instances). Where HS intialization is: at service startup and when the instance exits maintenance mode.
  • 8 copies of S1 present under health Service State directory
  • at sync time two instances of S1 are run, if I disable the colleting rule, just one instance of S1 is run.
  • if I remove the sync time from monitors and the rule I have no effect on cookdown: when a new instance of class C is discovered it triggers a config reload (event id 21025) and this make all instances on par
  • Removing sync time means that at agent startup I’ll have 8 runs of S1: 6 runs from on demand detections plus 1 run for two monitors and 1 run for rule

I repeated the same tests with an OleDB module as a probe with the same results. So in this respect the behavior is consistent.

 

Notes and conclusions:

  • cookdown for On Demand Detections works just per entity and per monitor, so having On Demands Detection for multiple instances classes can lead to serious performance issues.
  • since cookdown for ODDs is per instance and per monitor we can assume it is a mistake to use the same data source / provider used for standard detections, in fact we can have race conditions between the detections. Remember if you’ll have one run for each instance if the provider calculates the state for every single instance you’re going to recalculate the state of every instance n times where n is the number of instances. Suppose for example you put a lock (directly or indirectly) on a resource to get it’s state
  • cookdown doesn’t work between workflows with different ConfirmDelivery property. So if you have workflows that are using the same datasource but with ConfirmDelivery mismatch you will have two runs of the datasource (and no more than two since the ConfirmDelivery property has a boolean type and accept just true or false). This is why in my testing case I had two runs of S1.
  • the statement above implies we can have race conditions between workflows when they use the same data source / provider but have a different ConfirmDelivery property.
  • At the state of the art I must discourage ODDs for multiple instances monitoring
  • At the same time I raise a call to the product team: ODDs are so useful, if it’s too difficult to cookdown across entities and monitors just disable them at startup and after maintenance mode, after all they’re called On Demand. I speculate this shouldn’t be a huge QFE to implement. In any case, please spend a final word on how you’re going to implement ODDs in future releases so that we can understand in which cases to use them and in which not. In particular will they cookdown across entities and monitors?

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. #1 by Pete Zerger on August 10, 2009 - 8:18 am

    Daniele, Nice article, but I think I must misunderstand your intent. Are you suggesting that output from a single data source module feeding a rule and a monitor will not cook down?

    I have working examples of a data source passing script output to a performance collection rule and a two-state monitor.

    Would you clarify your statements above for me please?

    • #2 by Daniele Grandini on August 10, 2009 - 4:38 pm

      Hi Peter,
      what I’m trying to explain is that cookdown works just inside the same workflow type, at least from my observations. So if you have the same data source for a monitor and a rule, regardless of the number of instances, it will run twice: one for the monitor and one for the rule. If you have, as in my example, two monitors and one rule, it will run twice once again, one for the two monitors and one for the rule. Obviously it will cookdown inside monitors, so if you have 100 instances it will run just once for your monitor, and it will cookdown inside rules: if you have the same 100 instances it will run just once for your rule.
      Hope to have clarified what I’ve found.
      Ciao
      Daniele

  1. SQL database backup monitor a different approach « Quae Nocent Docent
  2. Discoveries, multihoming and cookdown « Quaue Nocent Docent
  3. On Demands and cookdown errata corrige « Quaue Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: