VMware Cloud on AWS Site Recovery Manager vSAN vSAN

Automating SDDC Scale-Out

Earlier this year, I had the privilege to present VMware Cloud on AWS DRaaS at Cloud Field Day #5. The segment was a full end-to-end product demo of vSphere Site Recovery and featured automating SDDC scale-out on failover.


During the session, one of the Cloud Field Day panelists, Ned Bellavance, asked if the script we used in the demo was published. It took a little while longer than intended, and in the interim, I rewrote the script as a personal learning exercise to use python and docker. The result of which is available here.

How does it work?

The sample uses the local IP address to determine if it’s running in the cloud. This works if the VM hosting the script is assigned a different IP address on failover. When it detects that it is running inside the cloud, it calls out to the VMware Cloud on AWS API. Automating SDDC scale-out to the designated size. This is but one of many techniques that can be used but is the easiest and most common in my experience. I’ve also known customers to query vCenter/SRM directly for more positive confirmation.


What about testing failover?

When executing a test, the VMs are connected to a designated bubble network. This network should have no path to the public internet. Meaning when performing a test, even though the script detects that it’s running in the cloud, it’s unable to access the VMware Cloud API to trigger the scale-out. If the test network uses the same subnet and is routable, then the sample provided could trigger a scale-out.


Why Automate?

For the same reason one usually automates any task: decreases risk and improves accuracy. In this case, that value is a bit debatable. On the one hand, by integrating the SDDC scaling activities into the failover itself, we greatly simplify the failover; invoke the failover, and vSphere Site Recovery will handle the rest. It’s also quite likely that there are additional components and or services that need to be coordinated on failover — making this one of many similar API calls made during a real event.


On the other hand, failover itself is a non-trivial decision that usually requires human interaction to invoke. While some customers do automate failover itself, it’s not advisable. The reduction in failover response time is rarely worth the increased complexity. If a human is going to manually trigger failover, it’s trivially easy to click add-hosts somewhere along the way.


What about Elastic DRS?

Elastic DRS is intended to make small incremental changes to an SDDC to keep the cluster scale appropriate to the workload demand. By adding and removing hosts as needed, some customers assume they can leverage this capability on failover to automatically scale the SDDC, and to a certain extent, they can. The problem is that Elastic DRS sleeps for 30 min after two consecutive scale-out events. For this reason, it is preferable to scale the cluster to the intended scale and then allow Elastic DRS to fine-tune based on load when large SDDC scale changes are anticipated.


To view the latest status of features for VMware Cloud on AWS, visit https://cloud.vmware.com/vmc-aws/roadmap.