vSAN

Reduce migration time to vSAN with svMotion / xvMotion and greater parallelization

Speed up Simple VM Migrations to vSAN

A few customers have asked me for recommendations on efficiently migrating their production VMs from traditional datastores to vSAN without disruption and within their maintenance window timeframe.  I enlisted the help of Christian Dickmann, Ravi Soundararajan, Zhelong Pan and Arun Ramanthan.  Arun shared this nugget of valuable information to achieve more parallel storage vMotions (any vMotion type migration that requires copying the VM’s storage) at the same time.  Two limits come into play that can be changed.

 

Defaults:

8 svMotion per datastore
2 svMotion per host

 

Reasoning why the static limits:

Storage vMotions require copying all the VM’s files from source to destination.  The defaults are designed low so that storage requests from VMs remain quick while storage vMotions take place.  Limits are not dynamic based on VMs’ storage I/O latencies… yet.  So, the values are very low so that svMotion has less chance of impacting running VMs. For the use case of one-time migration, the developers have no problem with us upping the limits. We should keep an eye on vSAN and VM performance stats to make sure the cluster or source datastore isn’t being overwhelmed with too many concurrent svMotion jobs. Using NIOC shares is also a good idea if sharing pNICs for vSAN and vMotion traffic.

 

How to tune the limits:

If migrating from a single traditional datastore or a small number of hosts to vSAN, consider tuning these advanced settings to both clusters.

Keep an eye on vSAN latencies, bandwidth, IOPS, congestion, and outstanding I/Os. Make sure it’s running well. Good idea: watch disk group write buffer % full/free for an idea of when to stop increasing limits. Also, keep an eye on the source datastore performance if you modify settings on the source hosts as well.

Kick off SvMotions and see how the IO latency of the datastore is affected by the new read/write streams. If you don’t see significant increase in latency, queueing etc. then you can increase the limit further.

 

How to change the limits:

NOTE: Change these advanced settings only if necessary. If issues arise, call GSS and revert back to the defaults.

You can use the following config options to set new limits.

1. config.vpxd.ResourceManager.costPerSVmotionESX6x

This is the ESX host cost per SvMotion. Default value is 8.

Max cost allowed per ESX host is 16. So 2 SvMotion per host. (8/16)

For eg. If you reduce the ESX host cost per SvMotion to 4 from default value of 8 then you can do 4 SvMotions per ESX host.

 

2. config.vpxd.ResourceManager.CostPerEsx6xSVmotion

This is the datastore cost per SvMotion. Default is 16.

Max cost per datastore is 128. So 8 Storage vMotion per datastore. (16/128)

For eg. If you reduce the datastore cost per SvMotion to 8 from the default value of 16, you can do 16 SvMotions in parallel on a given datastore.

 

These are advanced config options in vCenter and you can use the web client to set them. Notice how smaller numbers result in more concurrency.

Whichever limit is reached first (per host or per datastore, on a total of all sources or the vSAN cluster), that is the overall limit.

Experiment with the ESX hosts and datastore (with these config options + [performance stats]) to decide a reasonable limit that works for your environment.

If you want to set the limits to the max of 128 per datastore (vSAN cluster) and 16 per host, you would change the variables as such:

 

Example:

Let’s say we have 3 source VMFS datastores each with 50 VMs. All 3 datastores have a static limit of default 8 SvMotions in parallel.

We have 1 destination vSAN datastore. We increase the static limit to 20 SvMotions in parallel for this vSAN datastore.

 

Now, if the customer starts migrating 50 VMs from 1st VMFS datastore to vSAN datastore, 8 parallel SvMotions will be started and the rest of the 42 VMs will be queued in vCenter. Min (parallel SvMotions of src datastore, parallel SvMotions of dst datastore)

Immediately, customers starts migrating 50 VMs from the 2nd VMFS datastore to the same vSAN datastore. Since vSAN can support 20, it can take another 12 but source side can only handle 8. So 8 more SvMotions will be started in parallel and the rest of the 42 VMs will be queued.

If we follow this with migrating VMs from the 3rd VMFS datastore, 4 more SvMotions will be kicked off as vSAN is left with only 4 whereas the 3rd VMFS datastore is at the default of 8.

For this case

as long as all source datastore VMs are being migrated to one destination vSAN datastore. If it is many to many then it depends where the migration happens.

 

Another important point in the above example is the ESX host static limit of 2 SvMotions. Let’s say if all 50 VMs are on a single ESX host then only 2 SvMotions will be in parallel as they will be subjected to host limits. If they are distributed across multiple i.e 10’s of ESX hosts then 8 in parallel is possible (assuming the hosts have VMs on only 1 datastore and the source datastore capacity setting is still the default).

 

Summary

This can potentially speed up migrations to vSAN with more storage vMotion tasks running in parallel.  Use extreme caution as many storage vMotions in parallel cause a tremendous amount of storage and network I/O.

When the migration is complete, you should set these parameters back to their defaults so that VM latencies stay low during future SvMotions.