VMware

Interesting items in Update 2 for VMware Infrastructure 3.5 | Main | The virtual datacenter operating system from VMware

August 08, 2008

Top Tips for Deploying VI, part 2

 1. If you have an active/passive FC storage array (most mid-range arrays fall into this bucket), be careful about setup. Firstly, be sure to have redundant paths from FC switches to your arrays’ storage processors. Secondly, be sure to use “MRU” (the default) for the path-selection policy and not “fixed”.

The best way to explain the first issue is with a picture.  What’s wrong with the following configuration?

Vitips21

Although you might believe that you have full redundancy between the hosts and the switches, and specifically that you can survive one HBA failure on each host, the reality is that you don’t have enough redundancy.  Here’s one failure scenario that won’t be handled properly:

Vitips22

The reason is that, with active/passive storage arrays, a given LUN can only be presented on one storage processor at a given time.   The LUN can shift from one storage processor to another, but such a shift takes many seconds (potentially up to 30 seconds).   If both HBA’s have failed (as in the above diagram), then the ESX hosts won’t be able to access to the same LUN at the same time.  Host 1 attempts to access the LUN on storage processor 1; host 2 attempts to access the same LUN on storage processor 2; and you end up with a ping-pong effect, or a “path thrashing” effect due to the active/passive array shifting the LUN back and forth between the two storage processors.  Performance of VM’s on both hosts will be erratic and penalized.

The solution is simple: create redundant connections from the FC switches to the array storage processors, as shown below.

Vitips23

There is a second noteworthy issue with active/passive arrays related to this same path thrashing effect: make sure that you use the “MRU” path selection policy (the default) rather than the “fixed” path selection policy.  If you use “fixed”, you may make the mistake of forcing the use of a particular storage processor for one host… but a different storage processor for another host… and thus end-up in a similar LUN ping-pong or path thrashing situation.

For more details about path thrashing see, the SAN Configuration Guide.

2. When configuring your VI environment for VMotion, make sure that your physical network switches are configured properly; in particular, make sure that each port has the right network (e.g. VLAN) visibility. 

VMotion requires that the destination ESX host have similar network connectivity to the source ESX host (so that, for example, the VM can continue access to its assigned VLAN after the VMotion).  VirtualCenter checks for correct virtual switch configuration on the source and destination ESX; however, VirtualCenter does not for correct configuration of the physical network switches.  In a larger VI deployment where many network switch ports are involved, a single misconfiguration of a single physical switch port can be hard to detect.  The symptom will be as follows: when the particular VM relying on a particular VLAN id VMotion migrates to the particular ESX host with the misconfigured switch port, the VM loses all network connectivity.   Solution: when adding new ESX hosts to a network, take the time to double-check your network switch port configurations to make absolutely sure that all the VLANs are correctly configured.

3. When using VMware HA, take note of how memory reservations are specified and used to reserve cluster failover capacity.  Using more consistent reservations or disabling admission control are both appropriate workarounds if the calculations are overly conservative in your environment.

How VMware HA works: If a VMware ESX host fails, VMware HA will restart the VMs affected by that failure on alternate hosts in the cluster.  In order to do so, HA must reserve failover capacity within the cluster.  HA currently achieves this by implementing an “admission control” policy that prevents (or warns against) the powering on of VMs that would encroach upon the failover capacity being reserved.  In some cases, however, the admission control calculations may be too conservative.

Example scenario: Suppose you have 19 VMs, each with a 300 MB memory reservation.  To power-on all of these VM's, you need 5.7GB of RAM (=19*0.3) (total within the cluster, after allocating space for potential host failures, and not accounting for memory sharing in ESX).  Since all reservations are equivalent, HA defines an average VM to require 300 MB of memory.

Now, let's suppose you power-on a 20th VM with a 2 GB memory reservation.  Instead of calculating memory requirements as 7.7 GB (=19 x 0.3 + 1 x 2), HA takes a more conservative approach and redefines the average VM to be the biggest reservation observed.   With the higher reservation specified, HA will cautiously assume that every VM need 2 GB of memory, and will ask for 40GB (=20*2) of RAM to be set aside for total runtime and failover capacity within the cluster.  These calculations are intended to be conservative to ensure that sufficient spare capacity is available, without fragmentation across hosts within a cluster.

In many cases (such as clusters with widely varying sizes of hosts and VMs), however, these calculations can be more conservative than desirable, and can lead to “insufficient failover capacity” warnings when powering on more VMs.

Two potential approaches are recommended if you are observing these warnings, or want to avoid them within a heterogeneous cluster configuration:

Approach 1: Either lower the reservations on your most demanding VM’s, or remove the reservations skewing the calculations and rely upon “shares” instead.  See the resource management guide for differences between reservations and shares.

Approach 2:  Alternatively, configure HA to disable strict admission control.  Host failures will still be detected and acted upon, but VMware HA will not prevent the starting of new VMs due to insufficient failover capacity.

4. When sizing your LUNs, a medium-sized LUN (~500GB) seems best for most situations.   

Small LUN’s (and VMFS volumes) can result in SAN management complexity (too many LUNs to manage).  Very large LUN’s can result in performance issues, too coarse a granularity for troubleshooting and performance tuning, and failure/error isolation.  The below chart summarizes some of the considerations.  Details are provided on page 72 of the VI 3 SAN Design Guide.

 Smaller LUN /
VMFS volume
100GB
Medium-sized LUN /
VMFS volume
500GB
Larger LUN /
VMFS volume
3TB
VMFS: Metadata overhead Some overhead (0.5%) Negligible overhead (<0.1%) Negligible overhead (<0.1%)
Impact of a failure or error, difficulty of troubleshooting Affects a few VM's Affects 20-30 VM's Affects many VM's
Ease of SAN mgmt Hard (many LUN's to manage) Medium Easy (just 1 LUN to manage)
Ease of tuning performance (**) High (tunable per the few VM's on a LUN) Medium (tunable for 20-30 VM's at a time) Low (one setting for many, many VM's)
Flexibility in specifying value-added services  (***) High (different LUNs can have different policies or settings) Medium (tunable for 20-30 VM's at a time) Low (many VMs share the same policies or settings)

(*) File creation in VMFS grabs a SCSI lock on the LUN.  Excessive concurrent file creation in VMFS can cause lock contention, which can hurt performance.  This can be apparent if multiple users are concurrently creating VM’s (and therefore VMFS files), or when a VCB-based backup process is concurrently backing up multiple VM’s (and is therefore concurrently creating multiple VMFS REDO files)
(**) e.g. RAID-level, array caches, queue depths, path selection/path dedication
(***) e.g. Backup, other data protection features such as replication, mirroring, etc., capacity optimization features such as de-dupe, thin-provisioning, etc., security and encryption features

See also Top Tips for Deploying VI, part 1

--The VI Team

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/1051344/32185702

Listed below are links to weblogs that reference Top Tips for Deploying VI, part 2:

Comments

Is there anyway to view what reservation has been set aside within the cluster for HA or do you have to rely on using the forumla to come up with your own numbers?

I was really really surprised to read on HA admission control behaviour is it very add to my knowledge

Post a comment

If you have a TypeKey or TypePad account, please Sign In