Hyper-v Monitoring with #scom #sysctr

History

May, 9 2015 – Ignite version 1.1.0.15. Moved the repository to git-hub, pulled @MrTaoYang contributions

April, 14 2015 – Added tasks to manage virtual machines. MP version 1.0.0.109 released.

April, 9 2015 – Added more options for Hyper-V monitoring both community based and commercials. MP version 1.0.0.104 released.

April, 3 2015 – original post

Recently I’ve been involved on an issue where a set of virtual machines stopped replicating and OpsMgr didn’t raise any alert. Obviously the customer became aware of it when it needed the replica VMs. Too bad.

This situation induced me to check how we were monitoring Hyper-v and I didn’t like what I found.

Hyper-V Monitoring State of the Art

In terms of Hyper-V monitoring with OpsMgr this is what we can do today:

Use the OpsMgr to Virtual Machine Manager (VMM) integration (basically the VMM Management packs and some tuning achieve this result). This level of monitoring uses VMM for all the discovery data collection, it does a very light VM monitoring, basically just state and integration services monitoring, plus a couple of performance counters (CPU count and runtime in seconds). This MP leverages SDK calls so it’s not applicable in every architecture and scenario. Basically every time VMM and OpsMgr are in untrusted domains or when VMM hosts cannot access the OpsMgr SDK this integration is not going to work.
Use the Microsoft provided Hyper-V management pack. Actually you should implement both the VMM management pack and this one to try to have a “complete” monitoring solution. For example the VMM one doesn’t monitor Hyper-v replica. This performs a basic Hyper-v host monitoring, probably enough for most of the cases but for sure not as accurate as it should be, and a light VM monitoring with no performance collected and just the replica (alas broken) scenario covered.
Buy the Veeam Hyper-v management pack. This is a fairly complete management pack with a lot of scenarios covered. We couldn’t expect less from Veeam after the excellent VMWare MP. Obviously this is a commercial product and you have to pay for it. My customers are not fond with it, they’re expecting Microsoft to be able to monitor its hypervisor platform with its monitoring solution at no additional cost.
Use one of the community based management pack you can find on Codeplex, alas they’re far from optimal. For my knowledge they all used an hosted model for VMs and some scenarios are missing.
Buy Savision CloudReporter management pack, less expensive than Veeam, but probably less exhaustive as well. Still you have to pay for monitoring Hyper-V. (btw CloudReporter monitors VMWare, too)

If we exclude the good Veeam solution, what we got is very poor. Not only poor monitoring, but even broken monitoring for the Virtual Machines:

The replica scenario doesn’t work, or better said it doesn’t cover every scenario, since it goes very specific on replica failure codes, alas not every possible combination is covered
The VM health model and discovery is basically broken because the VM are hosted objects. Now this is a very bad design, VMs are, by definition, moving objects. Hosting a VM object means that every time a VM is migrated the current object is deleted and a *new* one is created. No chance to have a complete life cycle monitoring with history, performance collection and so on. More, since the discovery runs every 12 hours, every time a VM is moved there’s a gap in monitoring of up to 12 hours.

The community effort

All this said, was inevitable for me to be forced to build a community Management Pack. The management pack tries to address to above limitations, but it’s still very far from Veeam comprehensive and I really think I will never have the time to get there. After all, if anyone needs that kind of comprehensive monitoring probably a commercial product is worth the cost. On the other end the management pack is far from being finished, it will be a work in progress for at least the next 6 months. So let’s see what we have with the current version (1.0.0.84):

Comprehensive discovery with at least the same information you can get from the VMM integration, some properties discovered
- OS and platform
- Farm (very useful to create dashboards and groups)
- Hardware configuration
- Connected VHD(x)
- Connected NICs
An optimized discovery that is triggered when VMs are moved, started or stopped
A complete replica monitoring scenario with diagnostics, recoveries (disabled by default) and tasks to resume replica
A couple of monitor related to GPUs and remoteFX inherited from the MS provided MP, disabled by default since I consider them not useful. But in any case I didn’t want to lose monitoring capabilities.
Integration services obsolescence (only for Windows VMs since I don’t know how to update *nix VMs now that MS isn’t releasing anymore ISs for platforms that have the services included in the kernel)
VHD(x) fragmentation level monitoring, this one can be noisy, but if you have fragmented VHDs you better know it, since the performance impact can be significant
VM uptime in % in the observation period (as you wished @Stas), this can useful to bill on uptime
VM measures as performance counters for enabled virtual machines (Enable-VMMeasure): CPU usage, memory usage, normalized IOPS (super useful for capacity planning), network traffic inbound and outbound and more.
VM performance from perfmon counters (CPU, Memory, IOPs, network traffic) *starting from version 1.0.0.104*
Hyper-V host performance from prefmon counters (CPU, memory pressure, running VMs, virtual and logical processors allocated) *starting from version 1.0.0.104*
VM dynamic memory pressure monitor *starting from version 1.0.0.104*
VM management tasks *from version 1.0.0.109*
- Stop / Start / Turn Off / Restart
- Save / Resume
- Live Migrate
- Create / Delete / Restore / List checkpoints
VM preferred host detection. When a clustered/HA VM has preferred owners the MP checks the VM is indeed on one of the preferred nodes *from version 1.1.0.15*
VM name mismatch. Checks if the VM Name is on par with the name reported by the Hyper-v console *from version 1.1.0.15*

Known issues

The Measure-VM MeteringDuration cmdlet returns null after 35”. Due to this behavior the data written and read per second counters don’t work. Hopefully Microsoft will fix this one.
The measured statistics get reset at every script iteration this has an impact in a multihomed scenario or in general if cookdown is broken for whatever reason. Reasoning on this issue, probably I won’t anymore reset the measures at every iteration, but say every 30 days and transform all the performance collection rules in delta collections. I need to reset at some time the statistics, I don’t know what can happens with those counters after an extended period of time. Working on it, don’t have a due date. *Fixed in version 1.0.0.102, now measures get reset by default every 30 days, this can be disabled. Measures are now collected in delta and are multihoming friendly.*
The VM uptime percentage sometimes isn’t accurate to the second decimal figure due to rounding and VM stats not being updated frequently enough. Working on this issue, it will be fixed with next version. I need to find a workaround for the limited precision Hyperv has on uptime.

What’s missing

A lot
it just covers Windows Server Hyper-V 2012 R2
Knowledge base articles at all levels
Documentation
Better presentation through SquaredUp dashboards – this is what I’m currently working on
Some performance collection for the Hyper-V host, the actual CPU usage (parent + children partitions) is already in the next version I’m currently testing. *Added in version 1.0.0.104*
Tasks to enable VM measures
More thoughts on VM performances to be collected, making them multi-homing compatible, find a way to work around some bugs in the current Hyper-V implementation for VM measures *Fixed in version 1.0.0.104*
Should I alert on down VMs? After how much time? How to handle VM down by default or there shouldn’t be down VM on production systems? (apart from replica VMs, but that’s something manageable)

What’s next

For this to be a truly community based effort I need your help in both real world monitoring scenarios not currently covered, how-tos (for example is there a way to keep Integration Services up to date for *nix VMs?), KB articles, cool icons, idea for SquaredUp dashboards and so on.

On my part I used a few tricky MP development techniques that I want to document once for all:

Dynamic event based discovery
Dynamic monitoring retargeting
Cookdown for performances mapped from property bags

I’m going to write specific articles for each of these, primarily for my own memory, hoping to be useful to some other MP developer out there.

Where can I find it?

On github: https://github.com/brandubh/HypervMP

On technet gallery: https://gallery.technet.microsoft.com/Hyper-V-2012-R2-management-2e067735

Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

This entry was posted on April 3, 2015, 6:22 pm and is filed under Hyper-v, MP, SCOM, System Center. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

Quae Nocent Docent