Recently I’ve been engaged by my support engineers that were stuck on several critical alerts raised by the Sharepoint management pack. The most cryptic of them all was relates to the "Ping Web Application failed" monitor. The alert message states "A critical incident has occurred where the ping to the Web Application <url> has failed." Not very friendly indeed, the linked Knowledge base article is fairly generic and doesn’t help either.
I had to dissect the monitor a write a quick guide to troubleshooting such an alert, since it is raised for many different error conditions. In this post I’m going to share this guide.
First of all some basic info on the monitor:
- the monitor simply uses the URL monitoring standard probe to try an http GET to the discovered web apps
- if the http status code is > 400 then an error is raised, an error is raised for any communication issue as well
- the monitor runs every 1,800 seconds (30′)
- Since the WebApplication class is non-hosted the monitor gets executed from the RMS, this basically invalidates the monitor when the Web Application is not reachable from the RMS
<ProbeAction ID="Probe" TypeID="SCWebApp!Microsoft.SystemCenter.WebApplication.UrlProbe">
<Value>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</Value>
First of all if you are in a scenario where the sharepoint webapplication is not reachable from the RMS just disable the monitor: it’s useless. I want to spend some more time on this specific topic since you can find wrong information in forums. When you’re in this case you’ll probably get a 2147954407 error code or ERROR_WINHTTP_NAME_NOT_RESOLVED (see below on how to check for this specific error): typically, but not necessarily your RMS won’t be able to resolve the WebApp name, you can obviously have a timeout error as well, this happens when somehow the RMS is able to resolve the name, but then it doesn’t have a way to establish an http/s connection to the sharepoint server. The most interesting thing of this scenario is that if you override the URL in the monitor specifying the same URL reported in error, the monitor turns green and you are induced to think that now your WebApp is monitored, right? Wrong! I read many times in forums this answer, but in this scenario it is deadly wrong, you aren’t monitoring anything, you just turned the monitor to green thanks to a little bug in monitor implementation. The monitor doesn’t check for all the error codes the Url probe monitor can return, so it turns green even if the Url probe failed:
26F8.08D4::02/01/2012-09:39:24.192 [ModulesLibrary][URLProbe] [OnNewDataItems] New Batch of 1 DataItems delivered on port 0, completion callback 0000000000000000 context 0000000000000000
26F8.08D4::02/01/2012-09:39:24.192 [ModulesLibrary][URLProbe] [OnNewDataItems] Recieved DataItem <DataItem type="System.TriggerData" time="2012-02-01T09:39:24.0073161+01:00" sourceHealthServiceId="77A47D52-ABB9-C21C-FC0A-96E2E3611946"/>
26F8.08D4::02/01/2012-09:39:24.192 [ModulesLibrary][URLProbe] [IModuleHost->NotifyError] Module reported Fatal Error Message: <<Message cannot be displayed>> Source: Health Service Modules Id: -1073731322 Type: Error Category: 0
26F8.08D4::02/01/2012-09:39:24.192 [ModulesLibrary][URLProbe] [IModuleHost->NotifyError] succeeded
26F8.08D4::02/01/2012-09:39:24.192 [ModulesLibrary][URLProbe] [BatchedMonitoringModuleBase::OnNewDataItems] CallWorkerDoWorkItems ( portId, ppDataType, dwDataTypeCount, isBatchASet, pCompletionCallback, uCompletionContext ) completed successfully
Give it a try, just override the monitor Url parameter with anything you want (wrong or correct doesn’t make any difference) and you’ll obtain two results:
- the monitor will turn green
- and event id 10502 will be logged on the RMS
Credits for this analysis to a colleague of mine Marco Adamo.
The real useful info, as often it is, is in the state change context. This implies healthexplorer needs to be invokes from the alert, you won’t find any useful clue in the alert context itself. In the state change context you can find all the info returned by the http GET, the most important of them are:
- the URL that’s actually been probed (so that you can try by yourself)
- the HTTP status Code
- the Error Code
In the screenshot the error 2147954402 states the GET has timed out (the defuel timeout is 20" and cannot be changed by an override, sigh). You can check this table (http://technet.microsoft.com/en-us/library/dd348508(WS.10).aspx) for the error code meaning. A timeout error can be considered a transient error (if it resolves) while other types of error for example 2147954429 (cannot connect) is more concerning.
- this monitors tries a GET, it needs to have the proper rights to connect to the web app and it needs to resolves the web app name correctly
- Sharepoint strictly check for headers a site can have more than one name if the monitor checks on a decommissioned URL it can turn in error even if the users are able to connect to the same site using a different URL
- sometimes Sharepoint hosts under heavy load can take a long time to initialize web apps especially if they are apps scarcely used by the users, this can cause a lot of noise. The monitor doesn’t expose the timeout paramater so you have just two choices: disable the monitor for such an app or write your own monitoring using the standard web monitor capabilities of OpsMgr.
This posting is provided "AS IS" with no warranties, and confers no rights.