OpsMgr 2012 – agents across slow WAN links are unable to communicate


OpsMgr 2012 – agents across slow WAN links are unable to communicate

WARNING. This critical issue can affect new deployments as well as upgrades from Operations Manager 2007 R2. I can confirm it is present in SysCtr 2012 OpsMgr 2012 UR2, not have a chance to test it with SP1 beta, yet. This is a rapid publishing article I will set only the essentials information to understand the issue.

How the issue manifest itself

When you’re affected by the issue you’ll have the following symptoms:

·         Newly installed agents stay grey in console, they may or may not collect some data every now and then, but they generally stay grey

·         On the agents OpsMgr event log you get eventid 20070 with source OpsMgr Connector

clip_image001

·         If you change the agent authentication to use certificates nothing changes

The first key information is contained in the log entry: “the connection was closed immediately after authentication occurred”. This means Kerberos authentication is actually working, but the Management Server is closing the connection anyway. This is also why certificates based authentication doesn’t help (see the section on agents authentication for more info)

The OpsMgr Connector connected to XXXXXXXXX, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

From my researches this specific error relates to a general difficulty for the Management Server in accessing the connecting agent properties in Active Directory.

This can affect regular Management Servers as well as Gateways.

When the issue manifest itself

This specific issue manifest itself when the Active Directory lookup takes more than 1000 msec (more or less); this is typical of remote domain controllers across a slow WAN link. The typical scenario is:

          Management Server installed in domain A in a central site

          Agents to be managed in domain B (same forest) in a remote location

          No DCs for domain B are “near” in network terms to the Management Server, so they must accessed across the WAN link

          The ICMP latency between the Management Server and the Domain Controller is above 150 msec (this is no fixed rule)

Background information on agent authentication

To better understand what’s going on here some high level information on how an agent authenticates is needed:

          First an agent tries Kerberos authentication towards the assigned management server.

          If and only if Kerberos authentication fails (or timeouts in 1 sec) certificates authentication is tried, if a certificate has been associated with the agent. [In out case since Kerberos authentication is successful the certificate is never given a try]

          Once Kerberos authentication is successful the Management Server lookups Active Directory properties for the agent using the standard Active Directory mechanisms.

o   It binds to the closest Domain Controller

o   It search for a base object using the agent SID returned by the Kerberos authentication [checking the network trace actually it performs a couple of similar searches]

          If the LDAP lookup is successful then the channel is kept open, otherwise the channel is reset

How to check

To check if you can incur in the problem you can first test wich domain controller is going to be used by the Management Server for the agent domain:

Nltest /dsgetdc:<agentdomain>

Then you can mimic the MS query using this powershell script:

          First it gets the agent SID (this info is returned to the MS by Kerberos so this query it’s not actually performed by the MS)

          Secondly it gets the agent properties using ADSI and the agent SID, this is exactly the call made by the MS. If this takes near or more than 1 second you probably will encounter the error

$domain = ‘LDAP://<DCFQDN>’

$agent = ‘FQDN’

 

$start = [DateTime]::Now

 

$objSearcher = New-Object System.DirectoryServices.DirectorySearcher

$objSearcher.SearchRoot = $domain

$objSearcher.PageSize = 1000

$objSearcher.Filter = “(&(dNSHostName=$agent)(objectClass=computer))”

$objSearcher.SearchScope = “SubTree”

 

$colProplist = “name”,“objectSID”

foreach ($i in $colPropList)

{

    $objSearcher.PropertiesToLoad.Add($i) | Out-Null

}

 

$objSearcher.CacheResults=$false

 

$colResults = $objSearcher.FindAll()

 

$ts = New-TimeSpan -Start $start -End ([DateTime]::Now)

$prop = $colResults.Properties

Write-Host “First LDAP Elapsed milliseconds $($ts.Milliseconds) -ForegroundColor Yellow

 

$sid = $prop.Item(‘objectSID’)

$txtSid = ([System.BitConverter]::ToString($sid[0])).Replace(‘-‘, )

 

 

$agentLookup = $domain + ‘/<SID=’ + $txtSid + ‘>’

$start = [DateTime]::Now

$res = [ADSI] $agentLookup

$res | gm | Out-Null

Write-Host ‘Got: ‘ $res.dNSHostName

$ts = New-TimeSpan -Start $start -End ([DateTime]::Now)

 

Write-Host “LDAP Elapsed milliseconds $($ts.Milliseconds) -ForegroundColor Red

While I was heading to publish the script Stefan commented on this post pointing to a more evoluted script with logging, whistels and bells Smile so I won’t publish mine, but instead reference Stefan’s one at his own blog: http://blog.scomfaq.ch/2012/12/09/scom-2012-event-id-20070-agent-across-slow-wan-links/.

How to solve the issue

As of today, the only way to solve the issue I found is to install a Domain Controller for the remote domains in the same Active Directory site of the Management Server.

– Daniele

This posting is provided “AS IS” with no warranties, and confers no rights.

Advertisements
  1. #1 by Troy on May 14, 2014 - 7:34 am

    Hi
    had the same case with grayed agents.
    opened port TCP 389 from MP to DC and the problem resolved.

    hope this helps someone
    http://silentcrash.com/

  2. #2 by bloops on December 6, 2012 - 3:30 pm

    Hi Daniele, i am getting an error with your script, i am passing through
    $domain = ‘LDAP://DC=child;DC=parent;DC=local’
    $agent = ‘scomagent.child.parent.local’

    I am getting an error on line $sid = $prop.Item(‘objectSID’) the error is “you cannot call a method in null value expression”. I am assuming i have missed a parameter somewhere but not sure where?. can you help out?

    • #3 by Daniele Grandini on December 7, 2012 - 4:12 pm

      this error means the agent fqdn cannot be found in active directory. The $domain parameter must be the target DC fqdn such as: dc.child.parent.local.

      • #4 by bloops on December 7, 2012 - 6:03 pm

        Hi Daniele, thanks I am passing parameters like so :
        $domain = ‘LDAP://DC.child.domain.local’
        $agent = ‘hostname_of_server’

        but am still getting that error message. Also I had to amend your string replace from .Replace(‘-‘, “) to .Replace(‘-‘,”`””). (put in an escape character).

        when i step through $colresults has data but as soon as i step to a new line the data gets nulled..

      • #5 by Daniele Grandini on December 7, 2012 - 6:13 pm

        the ‘ issue is related to the wordpress publishing, I will post the script on TechNet gallery soon. If you get that error and the DC bind is correct it means the agent computer is not in Active Directory with the specified FQDN.
        – Daniele

    • #6 by scomfaq on December 9, 2012 - 9:55 pm

      Hi Daniele

      I have also written a post about this and modified your script and added some logging functionality. On my blog you can download the script. See here http://blog.scomfaq.ch/2012/12/09/scom-2012-event-id-20070-agent-across-slow-wan-links/

      Currently I am Troubleshooting this issue with Microsoft Support :)

      Cheers,

      Stefan

  3. #7 by George Varakis on December 4, 2012 - 1:50 pm

    Hi Daniele, does the same issue apply with certificate authentication? I am having issues with a gateway server connected through a high latency network, and never really establishing connectivity.

    • #8 by Daniele Grandini on December 5, 2012 - 3:35 pm

      Which is the gateways forest? In a pure certificate based authentication the issue I describe should not occur, but you must remember kerberos authentication is always tried first, so if your gateways are in the same or in a kerberos trusted forest of your MSs then you’re probably hit by this issue.
      Let me know.

      -Daniele

      • #9 by George Varakis on December 5, 2012 - 10:38 pm

        The gateway is a workgroup computer, not related in any other way to the MS. Pure certificate authentication, was working fine while it was well connected (40-50ms round trip) but wont connect on the MS when placed on a high latency location.
        thanks

        George

      • #10 by Daniele Grandini on December 7, 2012 - 4:14 pm

        Hi George,
        I think you’re facing a different issue. Which error gets logged on the gateway server? Did you take a network trace just to be sure there isn’t any firewall blocking the gateway communication?

  4. #11 by scomfaq on November 15, 2012 - 7:26 pm

    Hi Daniele

    Thanks a lot for your post.

    I am currently facing exactly this problem. I have several times run the second query on different Systems and I cannot say for sure what amount of miliseconds are bad or good. Do you have any update on this issue or any new hints you could provide? Did you call Microsoft about this behavior?

    Regards,

    Stefan

    • #12 by Daniele Grandini on November 15, 2012 - 7:56 pm

      Hi Stefan,
      I have only bad news. I tried to escalate the issue to Microsoft via CSS and connect and privately, but they’re insisting it is a borderline issue not worth a fix. I solved the issue creating the appropriate DCs “near” the MSs. I would suggest opening a CSS ticket, maybe if they get enough call they can reconsider their decision. They should have my record so it should not be so difficult to get through CSS (they will start with all kind of basic questions and testing)… good luck

  5. #13 by Jonathan Almquist on October 26, 2012 - 10:42 pm

    Thanks for the information. Hopefully I remember this if I see this problem elsewhere.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: