Agents management from the right Management Server

As many of you know, apart form deploying OpsMgr on customer sites, we deliver an outsourced monitor service based on OpsMgr. To do this we deploy one or more gateways at customer sites, these gateways send data to an OpsMgr infrastructure located in our datacenter. You guess we need to keep our agents run smoothly and this is a challenge on its own.

OpsMgr promises to take care of agents health and defines diagnostics and recoveries to restart or even reinstall broken agents. I’m not optimistic by nature but I tend to assume that if something is there it is tested and will, more or less, work. Right? Wrong.

All the diagnostics and recoveries for agent health are targeted to the HealthServiceWatcher class, and this makes sense, since it is the class deputed to remotely check agent health, but the HealthServiceWatcher related workflows are executed by the RMS.

This brakes the OpsMgr architecture where you’re supposed to have MSs/Gateways managing agents. Having diagnostics and recoveries executed by the RMS and not by the proper MS has some implications:

  • the agents must be reacheable from the RMS, if not, as soon as the agent stops heartbeating the computer unreacheable monitor turns red (it is based on a ping diagnostic from the RMS) and you have no clue if it is just an agent related issue or if the entire box is down.
  • agents restart recoveries won’t work either, you must have RMS to agent RPC connectivity *and* the automatic agent management runas account must be able to logon to the RMS and *not* to the MS/Gateways
  • agent reinstall is probably worse, it will always try to reinstall the agent assigning it to the RMS, this means that if you haven’t AD assignment in place and if the RMS is able to reach the agent you’ll find your broken agents moving from the assigned MS to the RMS

Obviously you’ll be affected by this only if you have a distributed OpsMgr architecture, in a simple, all in one deployment, everything will work.

In our environment, with different forests managed across the Internet, this means we have false positives and we cannot use the builtin diagnostics and recoveries. Too bad, if you consider that if we need to restart an agent we have to setup a VPN connection and remote to the customer with the appropriate account.

To lower our TCO this needed be addressed. We split the development in two steps:

  1. we must give our operators the ability to remotely execute actions from the proper gateway to the agent. First actions needed: restart the healthservice, ping the managed system.
  2. we must rewrite the diagnostics and recoveries associated with heartbeat failure monitor so that they will use the correct MS/GW

The first step is preparatory to the second one, in this blog post I will address the solution we implemented for step 1.

Side note: we not dared to change HealthServiceWatcher managed by relationship using a discovery (leveraging the relationships Microsoft.SystemCenter.HealthServiceShouldManageEntity and Microsoft.SystemCenter.HealthServiceManagesEntity), since it is very difficult to understand the side effects of this change.

If you don’t want to read further the insights on how to achieve this, just download the attached MP, it will load two tasks targeted at the HealthServiceWatcher class. The tasks will perform a remote restart and a remote ping. As usual the MP is raw, with few error checking and no whistles and bells, it just works. Just remember to deploy the correct runas account to your MSs or Gateways.

The first question we had to answer was, since we target the task to the HealthServiceWatcher and the workflows associated with are run by the RMS, how can we execute our task on the proper MS? Several problem here:

  1. how can we get the proper MS starting from the healthservicewatcher instance?
  2. how can we reach the MS and make it execute the proper task?

The first question is easily answered via some SDK calls: GetComputerHealthServiceByHealthServiceId and GetPrimaryManagementServer. Since there’s no documented way to deploy managed code assembly to agents, or in this case to MSs, we decided to use powershell:

$mg = (Get-Item .).ManagementGroup;
$admin = $mg.GetAdministration()
$hsa = $admin.GetComputerHealthServiceByHealthServiceId($wid)
$msa = $hsa.GetPrimaryManagementServer()

where $wid is the HealthServiceId property of the HealthServiceWatcherClass.

The second question can be reformulated as follows, how can Iweinstruct a MS to execute a given task?

And once again the answer is via the SDK: ExecuteMonitoringTask, or since we decided to use powershell the Start-Task cmdlet.

In summary:

  • In the management pack we have the “real” restart and ping tasks, with the logic of restarting and pinging, targeted to the ManagementServer Class. These are internal tasks not supposed to be executed by the operators.
  • In the management pack we have the “proxy” restart and ping tasks, with the logic of identifying the MS and starting the “real” tasks on the proper MS, targeted to the HealthServiceWatcher Class.

The “proxy” task code is the following:

$mg = (Get-Item .).ManagementGroup;
$admin = $mg.GetAdministration()
$hsa = $admin.GetComputerHealthServiceByHealthServiceId($wid)
$msa = $hsa.GetPrimaryManagementServer()
$ms = $msa.HostedHealthService
$TaskCrit=new-object -type Microsoft.EnterpriseManagement.Configuration.MonitoringTaskCriteria -argumentlist ("Name=’"+ $taskName +"’")
$TaskOverrides=Invoke-Expression $overrides
$result=Start-Task -task $task[0] -TargetMonitoringObject $ms -Overrides $TaskOverrides

The net result for the ping task is the following.


Using this same technique you can execute any task from the proper MS.

Hope this will help lowering your administrative burden for distributed OpsMgr architectures.

You can find the MP here: QuaueNocentDocent.AgentManagement.Tasks.xml.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

  1. #1 by Vratix on September 21, 2009 - 6:55 pm

    This is a great blog entry and answers the very issues I’ve been troubleshooting for weeks. Thank you again for this.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: