OpsMgr 2007 R2 – lessons learned


Now that OpsMgr 2007 R2 has entered RTM state we can share some of the lessons learned as early adopters. As many of you know our environment is not huge, we’re monitoring a few hundreds servers using gateways as a their primary point of contact, nevertheless we need to monitor three generations of software, windows 2000 thourgh 2008, SQL 2000 through 2008 and so on. We monitor VMWare and Oracle and soon we will be called to monitor *nix systems. Add to these specific and self developed management packs, we reach about 200 MPs deployed in production. This number of MPs has its challenges on its own.

As soon as we evaluated R2 in our pre-production environment it has become clear it was a must have, much more stable than SP1 event in RC code. So we moved into production thanks to our RDP participation. This decision had the net effect of getting us in a very busy period that started with MMS and it’s still running. Anyway it was worth so you must not be scared if I’m going to share attention points, other bloggers will tell you how cool is R2 and the fact that we moved to R2 is here to testify it is a huge step forward.

So, this is what we learned:

  1. R2 upgrade is a one way ticket, side by side migration is an option but if your OpsMgr infrastructure is mature and you’re using the data warehouse this is not a way you want to go. Inplace upgrade is risky and one way, even if you can technically step back (from backup copies) after a few hours you’ll find yourself in a situation where you cannot afford to lose the monitoring data you collected. So you better be prepared for it.
  2. the upgrade guide is good enough to drive you in the preparation steps but I would advice to put a good agent monitoring plan in place before upgrading, these are our standard checks (I will return on them with future posts)
    1. HealthService CPU Usage
    2. MonitoringHost CPU Usage
    3. Agent restarts (you can collect event ID 102 from OpsMgr Event Log) These can be caused by agent crashes or by the standard monitor on agent resource (private bytes and handles)
    4. Agent configuration reloads (i.e. 21025 events)
    5. Actual agent communication, heartbeat is not enough, we had heartbeating agents that were not uploading performance and event data
    6. Frequent discovery changes
  3. after upgrade we measured a noticeable increase in healthservice CPU usage both on RMS, gateways and agents. It is not still clear why we had this, given the fact we didn’t change our monitoring baseline, probably this is due to the fact that now rollup and dependency monitors are working in a more reliable way. On the other side we have about 20 21025 events on the RMS per hour, 21025 events force the reparsing of all the MPs and make the RMS call out for “unknown” monitors state. This has an agent impact. On the agent side the cpu usage on average moved from 3-5%  to 5-8%. Most affected agents:
    1. Domain Controllers
    2. Cluster nodes
    3. VMs hosted by Virtual Server
  4. We had a couple of run away agents (CPU usage above 50% on average) on cluster nodes multihomed with a SP1 Management Group. To get rid of them we had to reset the healthservice cache (i.e. delete the Health Service State directory)
  5. MPs continued to work as expected
  6. R2 new features are generally working ok

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Advertisements
  1. #1 by SCOMUser on May 26, 2009 - 9:22 pm

    Great post. What method do you use to troubleshoot agents that are communicating but not sending Performance and event data?

    • #2 by Daniele Grandini on May 27, 2009 - 5:44 pm

      Hi Marc,
      we developed a complex MP to assess Agent Health so that we can view in one view which agents are in troubles. I cannot share the MP, but I can share the SQL query we’re using. As soon I get rid of R2 deployment I will post the queries we’re using.

  1. OpsMgr 2007 R2 – lessons learned reprise « Quaue Nocent Docent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: