Discover how the Autoscaler Service within VMware Cloud on AWS continually monitors the health of your infrastructure, and has a robust system in place for remediation if or when the need arises. It’s all designed to help you build a truly resilient, high availability SDDC. Find out more.
One of the benefits of running your workloads in VMware Cloud on AWS is that VMware manages the platform, including all of the infrastructure and management components. VMware also performs regular updates across the SDDC fleet to deliver new features, bug fixes, and software upgrades.
Operationalizing common tasks for these components is crucial. The Autoscaler Service within the platform helps with this. Autoscaler consists of three primary functions:
- Auto Remediation: Replace problematic infrastructure based on virtual infrastructure events.
- Planned Maintenance: Replace Amazon EC2 instances and vSAN witness virtual machines (VMs) that are scheduled for retirement.
- Dynamic Scalability: Scale the SDDC up or down dynamically based on resource usage. You can read more about this in my Elastic DRS blog post.
The goal is to ensure your SDDC is truly elastic and self-healing without impacting the hosted workloads.
Continuous monitoring and validation
The three primary functions above can be carried out because we monitor the health of various SDDC components and services all the time. When an event occurs, it is forwarded to the Autoscaler, which reacts very quickly to validate and execute a remediation plan based on the type of event.
Prior to executing the remediation plan, the service will validate the condition. This is useful in the event of a transient error – for example, a minor network glitch may fire a false positive. If an event is thrown stating that a host is disconnected from vCenter, when in reality it is connected and healthy, further validation would ignore the event instead of attempting to remediate.
If the event is validated and identified as a real failure, we can now execute our remediation plan.
Remediation in the event of a failure
Let’s look at a host failure example. Whether on-premises or in the cloud, components within a host can and will fail.
Sometimes, it’s a minor issue – such that the host is running, but in a degraded state. This could be a redundant component like a fan or power supply, or even a single memory module. Other times, the component failure could be catastrophic – such as a processor or system board. In this case, Autoscaler receives the event, validates it, and then springs into action.
A key advantage of VMware Cloud on AWS is that we always have access to a fleet of hardware. This allows us to provision and add a host immediately to the cluster to ensure there is enough compute and storage capacity to perform VM migrations or an HA reboot if necessary. If a non-transient event occurs, a host is provisioned and added to the cluster before remediation action continues.
It’s important to note that you would never be charged for the addition of a host during auto remediation processes. The only time you would be charged for a host is in the event of an Elastic DRS (EDRS) scale-up due to storage or compute restraints from customer workloads.
Sample remediation plans
Let’s look at some high-level examples of what remediation plans might include:
- IF host experienced PSOD, THEN collect EBS snapshot and reboot host
- IF host is still not healthy, THEN remove and re-sync vSAN data to new host
- IF vSAN is not healthy, THEN soft reboot host and trigger vSAN repair
- IF host has history of multiple failures, THEN remove and re-sync vSAN data to new host
Of course, these are high level examples, and the workflows can range from very simple to complex in an effort to maintain SDDC availability.
As mentioned above, remediation steps only occur after an additional host has been successfully added to the cluster. Once remediation has been performed – and if the failure in the original host was able to be resolved and health checks are passed – the newly added host will be placed in maintenance mode and removed from inventory.
However, if the failed host could not be recovered, then it will be removed and the newly added host will now remain in the cluster. Once a failed host is removed from the cluster, it is returned to the fleet for AWS to repair.
Building a truly resilient SDDC
While the above example referenced a host hardware/component failure, Autoscaler will also address software failures such as PSODs, vCenter, vSAN, FDM, and so on.
It’s all about giving you access to the services and workflows that enable your SDDC to be truly resilient and highly available.
Resources
- Learn more about VMware Cloud on AWS at the VMware Cloud on AWS websiteor by viewing VMware Cloud on AWS: Overview
- Follow us on Twitter @vmwarecloudawsand give us a shout with #VMWonAWS
- Watch informative demos, overview videos, webinars and hear from our customers: VMware Cloud on AWS on YouTube
- Try the VMware Cloud on AWS Lightning Lab for a first-hand immersive experience
- Read our latest VMware Cloud on AWS blogs
- Read the VMware Cloud on AWS Solution Briefand VMware Cloud on AWS TCO 1-pager
- Follow the VMware Cloud on AWS release noteson continuing updates
- Read Technical Guideson Operations, Applications, and Performance
- Explore Feature Walkthroughsof Deployment, Configuration, Networking, and more
- Listen to latest episodes of VMware Cloud on AWS Unplugged Podcast