Best Practices for vSphere HA

Best Practices for vSphere HA

All the VMware admins out there in the wild already know what vSphere High Availability(HA) is and how useful it is. But it’s not always that everyone knows what the best way to configure HA for their environment is. A majority of admins just enable HA and leave all the default settings. The default settings are not bad, but since you know how your environment is designed, you should be able to better tune the parameters to minimize your downtime. Duncan Epping recently published a GitBook around HA Deepdive and it has loads of information that’s really useful. You can find the book here:

I just finished the book and thought it would be a really good idea to list down some of the key highlights and best practices that are mentioned in the book.

  1. Although HA is not dependent on DNS, it is still recommended to register the hosts with their FQDNs for ease of operations/management.
  2. Ensure syslog is correctly configured and log files are offloaded to a safe location to offer the possibility of performing a root cause analysis in case disaster strikes.
  3. In stateless environments, ensure vCenter and Auto Deploy are highly available as recovery time of your virtual machines might be dependent on them.
  4. Understand the impact of virtualizing vCenter. Ensure it has high priority for restarts and ensure that services which vCenter Server depends on are available: DNS, AD and database.
  5. To maximize the chance of restarting virtual machines after a failure we recommend masking datastores on a cluster basis. Although sharing of datastores across clusters will work, it will increase complexity from an administrative perspective.
  6. The master election process is simple but robust. The host that is participating in the election with the greatest number of connected datastores will be elected master. If two or more hosts have the same number of datastores connected, the one with the highest Managed Object Id will be chosen. This, however, is done lexically; meaning that 99 beats 100 as 9 is larger than 1. One thing to stress here though is that slaves do not communicate with each other after the master has been elected unless a re-election of the master needs to take place.
  7. Like the master to slave communication, all slave to master communication is point to point. HA does not use multicast.
  8. Network heartbeat is key for determining the state of a host. Ensure the management network is highly resilient to enable proper state determination.
  9. In a metro-cluster / geographically dispersed cluster we recommend setting the minimum number of heartbeat datastores to four. It is recommended to manually select site local datastores, two for each site.
  10. Datastore heartbeat adds a new level of resiliency but is not the be-all end-all. In converged networking environments, the use of datastore heartbeat adds little value due to the fact that a NIC failure may result in both the network and storage becoming unavailable.
  11. Virtual machines can be dependent on the availability of agent virtual machines or other virtual machines. Although HA will do its best to ensure all virtual machines are started in the correct order, this is not guaranteed. Document the proper recovery process.
  12. Configuring restart priority of a virtual machine is not a guarantee that virtual machines will actually be restarted in this order. Ensure proper operational procedures are in place for restarting services or virtual machines in the appropriate order in the event of a failure.
  13. Before upgrading an environment to later versions, ensure you validate the best practices and default settings. Document them, including justification, to ensure all people involved understand your reasons.
  14. Select a reliable secondary isolation address. Try to minimize the number of “hops” between the host and this address.
  15. Without access to shared storage, a virtual machine becomes useless. It is highly recommended to configure VM Component Protection (VMCP) to act on a Permanent Device Loss (PDL) and All Paths Down (APD) scenario. We recommend to set both to “power off and restarts VMs” but leave the “response for APD recovery after APD timeout” disabled so that VMs are not rebooted unnecessarily.
  16. 10GbE is highly recommended for vSAN, as vSphere HA also leverages the vSAN network. The availability of VMs is dependent on network connectivity, so ensure that at a minimum two 10GbE ports and two physical switches are used for resiliency.
  17. Take advantage of some of the basic features vSphere has to offer like NIC teaming. Combining different physical NICs will increase overall resiliency of your solution.
  18. Know your network environment, talk to the network administrators and ensure advanced features like Link State Tracking are used when possible to increase resiliency.
  19. Be really careful with reservations, if there’s no need to have them on a per virtual machine basis; don’t configure them. If reservations are needed, resort to resource pool based reservations.
  20. Avoid using advanced settings to decrease the slot size as it could lead to more downtime and adds an extra layer of complexity. If there is a large discrepancy in size and reservations we recommend using the percentage based admission control policy.
  21. When using admission control, balance your clusters and be conservative with reservations as it leads to decreased consolidation ratios.
  22. Although HA will utilize DRS to try to accommodate for the resource requirements of the virtual machine a guarantee cannot be given. Do the math; verify that any single host has enough resources to power-on your largest virtual machine. Also, take restart priority into account for this virtual machine.
  23. Admission control guarantees enough capacity is available for virtual machine failover. As such we recommend enabling it.
  24. Do the math, and take customer requirements into account. We recommend using a “percentage” based admission control policy, as it is the most flexible.
  25. In order to avoid wasting resources, we recommend carefully selecting your N+X resiliency architecture. Calculate the required percentage based on this architecture.
  26. VM and Application monitoring can substantially increase availability. It is part of the HA stack and we strongly recommend using it!
 I hope this was helpful, and it made you think about how you can improve your environment. And if you want to learn more, go back the GitBook that Duncan published, and let me know if I missed something. Drop a comment below if you have been through an HA event, and learned any of this the hard way.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s