This article originally started off life as a record of how I managed to get this working, as a lot of my posts do, but this time it appears I am foiled.
Last week, I had 3 vCenter Servers that appeared to be happily talking to each other in Linked Mode sharing a singe Multi-site SSO domain without any real issues. I had a single-pane-of-glass view of all 3 and I could manage them all from the one client. The reason for the 3 vCenter servers was segregation of LAN and DMZ networks: vCenter001 was in the LAN, vCenter002 sat in DMZ1 and vCenter003 sat in DMZ2.
At the weekend I rebuilt vCenter003 as a scheduled upgrade from Server 2008 to 2008R2 and from vCenter 5.1a to 5.1b, however when it came to joining the Linked Mode group, I failed each and every time with the error:
—————- Operation “Join instance VMwareVCMSDS” failed: Action: Join Instance Action: Join Instance Action: Create replica instance Action: Create Instance Problem: Creation of instance VMwareVCMSDS failed: Active Directory Lightweight Directory Services could not create the NTDS Settings object for this Active Directory Lightweight Directory Services instance CN=NTDS Settings,CN=vCenter003$VMwareVCMSDS,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,CN={411111112-F11C-4113-9117-91111111111113} on the remote AD LDS instance vCenter001.definit.local:389. Ensure the provided network credentials have sufficient permissions. Error code: 0x800706ec The list of RPC servers available for the binding of auto handles has been exhausted.
—————- Recovering from failed Operation “Join instance VMwareVCMSDS”
—————- Recovery successful
—————- Execution error.
I carefully checked the vCenter 5.1 Linked Mode prerequisites document but did not find any problem with my setup. I had in fact completed this process successfully twice before when connecting vCenter002 and the old vCenter003. After a while of troubleshooting I involved my friendly neighbourhood network support analyst who went through the ports listed by VMware with me again, and we found that the RPC Endpoint Mapper (port 135) and the high range dynamic ports were being blocked. We agreed to open 135 and a range for the dynamic ports.
Initially I limited this to a range of 1000 ports, however it quickly became clear that the rather large vCenter001 server had consumed these in seconds (each outbound connection uses a port from the dynamic pool, so each connection to a host would consume a port - multiply that by each destination port and add in the database connections and the pool needed to be considerably larger). In the end, 20,000 ports were opened for the use of dynamic ports.
At this point I was still unable to join the Linked Mode group, and the network traces were not showing any blocked network traffic. It did not appear to be a network issue!
We also tested using an IPSec tunnel between the two vCenter servers to allow all traffic without inspection at the firewall end, this is in fact how vCenter002 and vCenter001 are configured.
I engaged with VMware Tech Support and brought the technician up to speed. He asked me to go through the Troubleshooting vCenter Linked Mode document which is a great “checklist” for verifiying the requirements for Linked Mode.
In running the checks we came across lots of errors similar to:
1772 The list of RPC servers available for the binding of auto handles has been exhausted
and
DsBindWithCred to vCenter001:389 failed with status 1753 (0x6d9): There are no more endpoints available from the endpoint mapper.
His initial diagnosis was that the domain trust between the DMZ domain and the LAN domain was broken, however this was verified running through the document. The problem appeared to be related to RPC and the AD LDS (ADAM) setup. He went away to escalate the issue to the Escalation Engineers and came back to me today, hopefully he won’t mind me quoting his response:
I have looked for any Kb which states officially that nesting the VC in the DMZ and using ipsec tunneling is unsupported and I cannot locate any such document , I have fully discussed this also with Escalation Engineers whom also confirm that what I am saying is correct , VMware have never certified or tested the type of deployment
VMware’s guidance is to either use a LAN side vCenter server to manage the hosts through a DMZ, or to continue with the 3 vCenter servers as stand-alone servers.
At this point, troubleshooting the RPC error and the join problem become moot - I can’t run an unsupported configuration in a live environment! I don’t know why the new server would not join the Linked Mode group, certainly the default range for dynamic ports changed in 2008R2 and there is definitely a problem with the AD LDS installation on vCenter001.
Testing this setup through a firewall is definitely in the plan for my lab at some point and will definitely fall in the “unsupported” category, but it now leaves me with the task of pulling apart vCenter001 and 002 and ensuring they are in a stand-alone configuration.