Extending #MSOMS Agent health solution


I usually lecture my customers telling that there’s only one thing worse than an unmonitored critical service, a critical service you think it’s monitored but it is not (in reality). After all, it is exactly the who audits the auditor question.

Every monitoring system is prone to bugs that diminish its monitoring reliability. The problem when collecting monitoring data is that, more often than not, what is lost is lost forever and the missing information is exactly what you will be asked for. (one of the many Murphy’s law corollary)
OMS makes no exception and this is why the team released the Agent health & Heartbeat solution.
This solution sends an heartbeat with a data payload to Log analytics (LA) every few minutes (configurable), if you don’t get an heartbeat you can assume your agent is down or is experiencing issues. So far so good, so what more do I need, you might think. Well, agents in both Windows and Linux are fairly complex get & send robots. They get configuration items from the controller, execute those items querying specific internal and external providers, send the collected data back to the controller using a common channel. What the heartbeat solution tests is the ability for the agent to execute a rule (the heartbeat provider) and send the results back to LA. Long story short, you can have heartbeating agents that are not returning all the data they’re supposed to. For example you can discover you’re missing perf data for certain agents or that some Linux machines are not returning syslog data. The problem is, when you realize this it’s too late. This is exactly what happened to one of our customers that is using LA for monitoring production systems.
Too bad, I thought, let’s build a few more alerts to be notified if an agent goes nut (btw it is not always the agent fault, often is just the external provider not working).
For example to check for all the agents not returning Performance data you can query with

* Computer NOT IN {Type:Perf | measure count() by Computer} | measure count() by Computer

Easy, isn’t it? It would have been easy indeed if the Computer field had been case insensitive, but alas it is not.
So the query returns false positives due to the fact that server1 is different from Server1 and the various Type providers in OMS/LA return different cases. This is going to change, but for now this is the state of the art.
So let’s try another way, let’s say that I want to know all the computers not returning Perf data but that are heartbeating, something like:

* Computer IN {Type:Heartbeat| measure count() by Computer} AND Computer NOT IN {Type:Perf | measure count() by Computer} | measure count() by Computer

or

* Computer NOT IN {Type:Perf Computer IN {Type:Heartbeat| measure count() by Computer} | measure count() by Computer} | measure count() by Computer

Alas it seems that combining IN / NOT IN clauses it is not allowed yet.

So what can you do if you need this check today?
Fortunately LA is an open system where not only you can query for data but you also can ingest your own data. See for example this post or the official documentation

Important this is a temporary solution, I’m pretty sure the Computer field will become case insensitive and the search language will evolve in a way this tiny solution won’t be needed anymore. But if you need it today feel free to use it, just remember is a quick and dirty solution it lacks documentation and error management. It can evolve in the future only if I learn that computer case insensitiveness will take long to implement.

What I did is to create a powershell script that implements this pseudo code

$types = Get-QueryResults '* | measure count() by Type'
$collectedTypes= Get-QueryResults '* | measure max(TimeGenerated) As LastData, count() As Points by Computer, Type | sort Computer'
$dataset = Invoke-PopuleMissingData $types $collectedTypes
postto-OMS $dataset

the complete source code can be found here

I then created a runbook importing the script in Azure Automation and scheduled it to run once per hour. What I got is a new Type in LA I can query on:
omshertabeat-type

With this quick solution I can now query what I need, for example give my all the Linux computers not returning Syslog or Perf data become like this

Type:QNDHeartbeatEx_CL (Type_s=Syslog OR Type_s=Perf) Computer IN {Type:Heartbeat OSType=Linux| measure count() by Computer}| measure sum(Points_d) by Computer, Type_s | where AggregatedValue=0 | sort Computer

Similar for Windows, let’s say all my Win systems should be collecting Events, Security Events and Performance points

Type:QNDHeartbeatEx_CL (Type_s=Event OR Type_s=SecurityEvent OR Type_s=Perf) Computer IN {Type:Heartbeat OSType=Windows | measure count() by Computer}| measure sum(Points_d) by Computer, Type_s | where AggregatedValue=0 | sort Computer

As you can see I have indeed agents not performing the way I’d like

omshertabeat-missing

Now it’s easy to create alerts for agents exhibiting odd behaviors or/and create a standard OMS/LA Solution for an extended Agent Health view.

omshertabeat-solution

Hope this is useful to someone else waiting for LA to make it unnecessary.

-Daniele
This posting is provided “AS IS” with no warranties, and confers no rights

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: