I rebuilt an ESX host in my HA/DRS cluster today, following my build procedure to configure as per VMware best practices and internal guidelines. When the host was fully configured and up-to-date, I added it to the cluster and enabled HA and DRS. Then I went to generate some DRS recommendations to balance the load an ease off my overstretched host, but no recommendations were made.
I couldn’t manually migrate any VMs either – it was odd, because both hosts were added into the cluster, and could ping and vmkping each other from the console.
I also received email alerts -
[VMware vCenter - Alarm Host error] Error detected on [HOST] in [Data Center]: Agent can’t send heartbeats.msg size: 1266, sendto() returned: Operation not permitted
It turns out that there were slight naming differences between the default VMKernels on each host, which stops communication. Since one VMKernel was named “VMKernel” and the other “VMKernel 2” it stops the migrations, and hence DRS. The hosts would add into the cluster OK, DRS actually showed as “imbalanced” on the Cluster summary screen - it was just DRS and vMotion which wouldn’t work.
With the VMKernels renamed to exactly the same thing, DRS kicked off no problem, as did a manual migration.
So the moral of the story is this; name ALL networks in the same cluster identically. It makes sense when you think that the VM needs to see it’s Virtual Machine Network on each host – why should the Service Console and VMKernel be any different?