Now that OpsMgr 2007 R2 has entered RTM state we can share some of the lessons learned as early adopters. As many of you know our environment is not huge, we’re monitoring a few hundreds servers using gateways as a their primary point of contact, nevertheless we need to monitor three generations of software, windows 2000 thourgh 2008, SQL 2000 through 2008 and so on. We monitor VMWare and Oracle and soon we will be called to monitor *nix systems. Add to these specific and self developed management packs, we reach about 200 MPs deployed in production. This number of MPs has its challenges on its own.
As soon as we evaluated R2 in our pre-production environment it has become clear it was a must have, much more stable than SP1 event in RC code. So we moved into production thanks to our RDP participation. This decision had the net effect of getting us in a very busy period that started with MMS and it’s still running. Anyway it was worth so you must not be scared if I’m going to share attention points, other bloggers will tell you how cool is R2 and the fact that we moved to R2 is here to testify it is a huge step forward.
So, this is what we learned:
- R2 upgrade is a one way ticket, side by side migration is an option but if your OpsMgr infrastructure is mature and you’re using the data warehouse this is not a way you want to go. Inplace upgrade is risky and one way, even if you can technically step back (from backup copies) after a few hours you’ll find yourself in a situation where you cannot afford to lose the monitoring data you collected. So you better be prepared for it.
- the upgrade guide is good enough to drive you in the preparation steps but I would advice to put a good agent monitoring plan in place before upgrading, these are our standard checks (I will return on them with future posts)
- HealthService CPU Usage
- MonitoringHost CPU Usage
- Agent restarts (you can collect event ID 102 from OpsMgr Event Log) These can be caused by agent crashes or by the standard monitor on agent resource (private bytes and handles)
- Agent configuration reloads (i.e. 21025 events)
- Actual agent communication, heartbeat is not enough, we had heartbeating agents that were not uploading performance and event data
- Frequent discovery changes
- after upgrade we measured a noticeable increase in healthservice CPU usage both on RMS, gateways and agents. It is not still clear why we had this, given the fact we didn’t change our monitoring baseline, probably this is due to the fact that now rollup and dependency monitors are working in a more reliable way. On the other side we have about 20 21025 events on the RMS per hour, 21025 events force the reparsing of all the MPs and make the RMS call out for “unknown” monitors state. This has an agent impact. On the agent side the cpu usage on average moved from 3-5% to 5-8%. Most affected agents:
- Domain Controllers
- Cluster nodes
- VMs hosted by Virtual Server
- We had a couple of run away agents (CPU usage above 50% on average) on cluster nodes multihomed with a SP1 Management Group. To get rid of them we had to reset the healthservice cache (i.e. delete the Health Service State directory)
- MPs continued to work as expected
- R2 new features are generally working ok
This posting is provided "AS IS" with no warranties, and confers no rights.