Things never come easy, I should have known better this Cluster Shared Volume (CSV) monitoring story would lead to more work to be done. But, after all, this is what Quae Nocent Docent (what hurts teaches) is here for.
In my last post and project on CSV monitoring I started from a wrong assumption based on faulty observations: the cluster group keeps count of everything happens on CSVs. This is not the case and (now) it’s obvious that CSV access is indeed distributed across nodes, every node knows its own part of IOs and responsiveness for the shared volume. As you can see in the following screenshot the total CSV IO is the sum of the IOs on the nodes (2A, 2B) while the cluster group network name just collects the IOs from the node it is currently hosted on.
Standard cluster disk aren’t on the contrary, shared by multiple nodes, so it’s correct to use the owning group network name to collect performance and monitor status.
The net result is I had to rewrite the management pack from scratch modifying the health model to take into account this distributed model. The first consequence is that the new management pack is not backward compatible with the previous one: the old one must be removed before installing version 18.104.22.168 and above. The second consequence is I got in touch with Sergei Jemeljanov who added some views to the original management pack and suggested to set all the monitors and rules disabled by default (with the exception of discovery rules) and then deliver a couple of overrides management pack to enable the monitoring. As a result the management pack is now available not only on Technet Gallery, but as a codeplex project you are encouraged to contribute to.
I had the following requirements:
– integrate the new health model with the existing one, where the CSV is monitored by the cluster group in terms of availability and space
– CSV state must contribute to Windows Cluster state
– Since CSV access is distributed it’s handy to have a container object that sums up all the contributing nodes for the specific CSV. This way performance views, dashboards and reports for each and every CSV can be easily setup starting with this container.
– I don’t want, by default, to be alerted from every single node when a CSV has performance issues. Rather I need a single alert and then I must be able to use health explorer to drill through the issue.
In the following diagram the green boxes are the classes added by the management pack and the red arrows are the added relationships.
The resulting health model, starting with the cluster object, rolls up to the availability and performance nodes. The availability is based on the free space monitoring delivered by the Microsoft Management Pack, the performance node is built from QND performance monitors. Both have a group for every single CSV and for the performance node the performance perspective from every contributing node.
The management implements the following collections and monitors:
- Cluster Disk – Monitor – Disk latency
- Cluster Disk – Collection- Overall disk response time (sec/transfer)
- Cluster Disk – Collection- Disk reads per second
- Cluster Disk – Collection – Disk writes per second
- Cluster Disk – Collection – Free MB
- Cluster Disk – Collection – Free space %
- CSV – Monitor – Disk Read Latency
- CSV – Monitor – Disk Write Latency
- CSV – Monitor – Disk Redirected Read Latency
- CSV – Monitor – Disk Redirected Write Latency
- CSV – Collection – Disk Read Latency
- CSV – Collection – Disk Write Latency
- CSV – Collection – Disk Reads per second
- CSV – Collection – Disk writes per second
- CSV – Collection – Disk Redirected Read Latency
- CSV – Collection – Disk Redirected Write Latency
- CSV – Collection – Disk Redirected Reads per second
- CSV – Collection – Disk Redirected writes per second
The management pack adds a few views under the Microsoft Windows Cluster folder tree
Under the “CSV State” view all the CSVs are listed with the contributing nodes and specific monitoring scope:
The management pack monitors CSV starting from Windows Server 2012 and above. I don’t have a Windows Server 2008 R2 cluster to test the MP on, so I set the discovery to consider only Windows Server 2012 and above.
The management pack badly misses reporting. Given the distributed nature of CSVs the standard performance reporting falls short. We need a report that sums IOs across nodes and compute a weighted average across nodes for response times. More work to be done.
This posting is provided “AS IS” with no warranties, and confers no rights.