VMware Cloud on AWS

Improved Storage Policy Capacity Management

Hidden within the 1.12 update is a significant change in the vSAN object layout that will reduce capacity consumption when changing the Storage Policy configuration.  While this change is entirely transparent for most of our customers, we’d still like to discuss what’s happening and why it matters.

Background

vSAN has a series of interlocking systems that make policy reconfiguration as seamless as possible.  However, there are still scenarios where changing a policy could cause Elastic DRS to add one or more Hosts due to storage capacity.  This was especially severe when manipulating massive objects because, before 1.12 (vSAN7u1), vSAN required enough free capacity to store an object using both the source and destination policies.

Why?

Under the covers, an Object in vSAN consists of several components. The exact number and composition depend on the policy configuration, but no single component can be larger than 255GiB.  Prior versions of vSAN would strip across components in a raid-0 tree, enabling objects to grow up to 62TiB. While powerful, this approach also meant that the object had to be managed as a single entity. Policy reconfigurations were always non-disruptive but could become capacity intensive.

For example, a 22TiB VMDK using 1 Failure – Raid-1 would consume 44TiB of datastore capacity.  To transition that object to a 2 failure – Raid-6 policy, there must be at least 33TiB of consumable free capacity in the datastore. As you can imagine, this can and has caused one, if not several temporary hosts to be added by Elastic DRS due to the capacity trigger.

 

Introducing Large Objects

Starting with 1.12, vSAN now employs an improved object structure. The component architecture is unchanged, but Objects are created by concatenating multiple components together.

This change empowers vSAN to recreate an object by changing the components serially, walking through an object only using the size of a single component to transition between policies.

As if this weren’t enough, while they were at it, the team also added a new large component variant to improve component consumption for truly massive workloads. The way it works is relatively intuitive. The first 8TiB (or 32 components) of any object are provisioned on-demand as needed using the 255GiB components. After 8TiB, vSAN switches to 765GiB components. This vastly reduces the total number of components required to create 10TiB+ objects.

Impact on existing customers

For the most part, this will be transparent. Once we begin the upgrade VMDKs that are less than 8TiB are converted in place. This will cause some increased activity while vSAN performs a one-time object conversion. Once upgraded, these objects will use the improved object layout and are future-proofed even if they grow past 8TiB.

Objects larger than 8TiB will need to undergo a similar but more intensive one-time conversion as the data needs to move to use the new large components layout. There is a vCenter alarm that will notify VMware Operations so they can schedule the upgrade to minimize the impact.

The object upgrade process is similar to a traditional policy resynchronization and may require additional hosts. Any hosts added to help facilitate either phase of the upgrade will not be billable to the customer. They are part of the upgrade process and included in the service.  Once completed, Elastic DRS will be reengaged with the same thresholds as before, but vSAN will now be able to throttle policy rebuild – Reducing potential Host additions in the future.

Summary

The 1.12 upgrade is shaping up to be a powerhouse of simplified operations.  With that said, not everything is free, and in this case, the upgrade itself may take slightly longer than we’d like. Additionally, while vSAN will throttle these activities as needed to protect your workload, you may notice a performance impact at times. We apologize if this transition causes any inconvenience. Once completed, we hope to reduce if not completely eliminate host additions due to transient capacity.  This upgrade represents several years of accumulated work and represents the services’ commitment to continuous improvement. More importantly, this upgrade is the first step towards further increasing the usable capacity within the service.  We look forward to sharing the next steps in our relentless efforts to improve the service!

Availability

To view the latest status of features for VMware Cloud on AWS, visit https://cloud.vmware.com/vmc-aws/roadmap.

Resources: