Posted by Cormac Hogan
Technical Marketing Manager (Storage)
This is a discussion that comes up regularly, going way back to vSphere 4.0 when we first introduced some serious features around thin provisioning storage for Virtual Machines. The objective of this post is to take a look at what options are available to you to do over-commitment/over-allocation of storage, as well as avoiding stranded storage scenarios. The whole point of thin provisioning, whether done on the array or at the hypervisor layer, is to allow a VM to run with just the storage it needs, and to avoid giving a VM storage that it might use sometime in the future. After all, you're paying for this storage, so the last thing you want is to be paying for something you might never use.
Lets begin by looking at the various types of Virtual Machine disks (VMDKs) that are available to you.
Thin – These virtual disks do not reserve space on the VMFS filesystem, nor do they reserve space on the back-end storage. They only consume blocks when data is written to disk from within the VM/Guest OS. The amount of actual space consumed by the VMDK starts out small, but grows in size as the Guest OS commits more I/O to disk, up to a maximum size set at VMDK creation time. The Guest OS believes that it has the maximum disk size available to it as storage space from the start.
Thick (aka LazyZeroedThick) – These disks reserve space on the VMFS filesystem but there is an interesting caveat. Although they are called thick disks, they behave similar to thinly provisioned disks. Disk blocks are only used on the back-end (array) when they get written to inside in the VM/Guest OS. Again, the Guest OS inside this VM thinks it has this maximum size from the start.
EagerZeroedThick – These virtual disks reserve space on the VMFS filesystem and zero out the disk blocks at creation time. This disk type may take a little longer to create as it zeroes out the blocks, but its performance should be optimal from deployment time (no overhead in zeroing out disk blocks on-demand, meaning no latency incurred from the zeroing operation). However, if the array supports the VAAI Zero primitive which offloads the zero operation to the array, then the additional time to create the zeroed out VMDK should be minimal.
Option 1 – Thin Provision at the Array Side
If you're storage array supports it, devices/LUN can be thinly provisioned at the back-end/array. The advantage is physical disk space savings. There is no need to calculate provisioned storage based on the total VMDKs. Storage Pools of 'thin' disks (which can grow over time) can now be used to present datastores to ESXi hosts. VMs using thin or lazyzeroed VMDKs will now consume what they need rather than what they are allocated, which results in a capex saving (no need to purchase additional disk space). Most arrays which allow thin provisioning will generate events/alarms when the thin provisioned devices/pools start to get full. In most cases, it simply a matter of dropping more storage into the pool to address this, but of course the assumption here is that you have a SAN admin who is monitoring for these events.
Advantages of Thin Provisioning at the back-end:
- Address situations where a Guest OS or applications require lots of disk space before they can be installed, but might end up using only a portion of that disk space.
- Address situations where your customer state they need lot of disk space for their VM, but might end up using only a portion of that disk space.
- In larger environments which employ SAN admins, the monitoring of over-committed storage falls on the SAN admin, not the vSphere admin (in situations where the SAN admin is also the vSphere admin, this isn't such an advantage)
Option 2- Thin Provision at the Hypervisor Side
There are a number of distinct advantages to using Thin Provisioned VMDKs. In no specific order:
- As above, address situations where a Guest OS or applications require lots of disk space before they can be installed, but might end up using only a portion of that disk space.
- Again as above, address situations where your customer state they need lot of disk space for their VM, but might end up using only a portion of that disk space.
- Over-commit in a situation where you need to deploy more VMDKs than the currently available disk space at the back-end, perhaps because additional storage is on order, but not yet in place.
- Over-commit, but on storage that does not support Thin Provisioning on the back-end (e.g. local storage).
- No space reclamation/dead space accumulation issues. More on this shortly.
- Storage DRS space usage balancing features can be used when one datastore in a datastore cluster starts to run out of space on one datastore, possibly as a result of thinly provisioned VMs growing in size.
Thin Provisioning Concerns
There are a few concerns with Thin Provisioning.
- Possibly the biggest issue that we have with Thin Provisioning is running out of space on a device that is Thin Provisioned at the back-end. Prior to vSphere 5.0, we didn't have any notifications about this in the vSphere layer, and when the thinly provisioned datastore filled up, all of the VMs on that datastore were affected. In vSphere 5.0 a number of enhancements were made through VAAI:
– VAAI will now automatically raise an alarm in vSphere if a Thin Provisioned datastore becomes 75% full
– VMs residing on a Thin Provisioned datastore that runs out of space now behave differently than before; only VMs which require additional disk space are paused. VMs which do not require additional disk space continue to run quite happily even though there is no space left on the datastore.
– if the 75% alarm triggers, Storage DRS will no longer consider this datastore as a destination.
- The second issue is dead space reclamation and the inability to reuse space on Thin Provisioned datastore. Prior to vSphere 5.0, if a VM's file are deleted or if a VM is Storage vMotioned, we had no way of informing the array that we are no longer using this disk space. In 5.0, we introduced a new VAAI primitive called UNMAP which informs the array about blocks that are no longer used. Unfortunately there were some teething issues with the initial implementation but we expect to have an update on this very shortly.
- If the VMDK is provisioned as thin, then each time the VMDK grows (new blocks added), the VMFS datastore would have to be locked so that it's metadata could be updated with the new size information. Historically this was done with SCSI reservations, and could cause some performance related issues if a lot of thinly provisioned VMs were growing at the same time. With the arrival of VAAI and the Atomic Test & Set primitive (ATS) which replaces SCSI reservations for metadata updates, this is less of a concern these days.
So which option should I go for?
Thick on Thin
First, Eagerzeroedthick VMDKs do not lend themselves to thin provisioning at the backend since all of the allocated space is zeroed out at creation time. This leaves us with the option of lazyzeroedthick, and this works just fine on thin provisioned devices. In fact, many storage array vendors recommend this approach, as the management of the over-commited devices falls to the SAN admin, and as mentioned earlier, most of these array have alarms/events to raise awareness around space consumption, and the storage pool providing the thin provisioned device is easily expanded. One caveat though is the dead space accumulation through the deletion of files and the use of Storage vMotion & Storage DRS, but hopefully we'll have an answer for this in the very near future.
Thin on Thick
One of the very nice things about this appraoch is that, through the use of Storage DRS, when one datastore in a datastore cluster starts to run out of space, possibly as a result of thinly provisioned VMs growing in size, SDRS can use Storage vMotion to move VMs around the remaining datastores in the datastore cluster and avoid a datastore filling up completely. The other advantage is that there are no dead-space accumulation/reclamation concerns as the storage on the back-end is thickly provisioned. One factor to keep in mind though is that Thin provisioned VMDKs have slightly less performance than thick VMDKs as the new blocks allocated to the VMDK were zeroed out before the I/O in the Guest OS is commited to disk. The metadata updates may also involve SCSI Reservations instead of VAAI ATS if the array does not support VAAI. However, once these VMDKs have grown to their optimum size (little further growth), then this overhead is no longer an issue/concern.
Thin on Thin
This is the option I get the most queries about. Wouldn't this give you the best of both worlds? While there is nothing inherently wrong with doing thin-on-thin, there is an additional management overhead occurred with this approach. While VAAI has introduced a number of features to handle over-commitment as discussed earlier, thin provisioning will still have to be managed at the host (hypervisor) level as well as at the storage array level. But keep in mind that this level of over-commitment could lead to out of space conditions occuring sooner rather than later. At the VMDK level, you once again have the additional latency of zeroing out blocks, and at the array level you have the space reclamation concern.
With all this in mind, you will have to trade off each of these options against each other to see which is the most suitable for your environment.
How much space does a thin disk consume?
On classic ESX, you can use the du command against a VMDK to determine this.
On ESXi, you can use the stat command against a VMDK to get the same info.
# stat zzz-flat.vmdk
Size: 4294967296 Blocks: 3502080 IO Block: 131072 regular file
Device: c2b73def5e83e851h/14030751262388250705d Inode: 163613572 Links: 1
Access: (0600/-rw——-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-02-29 13:04:53.000000000
Modify: 2012-03-01 15:29:11.000000000
Change: 2012-02-22 00:12:40.000000000
This is a thinly provisioned 200GB VMDK (4294967296 * 512) but only ~ 1.8 GB (3502080 * 512) used.
A note on Fault Tolerant VM & MSCS nodes
Be aware that if you turn on FT on a VM, and that VM is using a thinly provisioned VMDK, that VMDK is inflated with zeroes. This is a caveat of FT which uses multi-writer mode on the VMDK, which is needed when the primary and secondary FT VMs are to access the same disk. The same is true for VMs which participate as nodes in Microsoft Clustering – See http://kb.vmware.com/kb/1033570
I may have missed some pros and cons to of some the options listed above. Feel free to leave comments on why you feel one approach is better than another. I'd be very interested to hear.
Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage