VMFS Extents - Are they bad, or simply misunderstood?

If there is one thing that is sure to fire up a debate in VMware storage circles, it is whether or not a VMFS volume should make use of extents. I’ve watched many an email thread about this, and engaged in a few discussions myself. What I want to do in this post is show what some of the pros and cons are, explode some of the myths, and then let you make up your own mind as to whether you want to use them or not. I’ll give you my own opinion at the end of the post.

What is an Extent?

Probably best to describe what an extent is first of all. A VMFS volume resides on one or more extents. Each extent is backed by a partition on a physical device, such as a LUN. Normally there will be only one extent for each LUN (whole LUN contains a single partition which is used as a VMFS extent). Extents are used when you create a VMFS volume, and more extents can added if you want to expand the volume.

The maximum size of a single VMFS-3 extent was 2TB. This was due to a number of things, including our reliance on SCSI-2 addressing and the MBR (Master Boot Record) partition format. Adding additional extents to a VMFS-3 volume was the only way to extend it above 2TB. No single LUN/extent used for VMFS-3 could be above 2TB. Adding 32 extents/LUNs to the same VMFS-3 volume gave you a maximum size of 64TB. VMFS-5 volumes, introduced in vSphere 5.0, can be as large as 64TB on a single extent. This is because we implemented the GPT (GUID Partition Table) format & made a significant number of SCSI improvements.

This all seems ok, right? So what’s the problem? Why have extents got a bad name? Let’s begin by exploding a few of the myths around extents.

Misconception #1 – Extents are like RAID stripes

This is one of the common misconceptions. I’ve seen some folks believe that the Virtual Machines deployed on a VMFS volume (with extents) are striped (or the file blocks/clusters allocated to the VMs are striped) across the different extents.

This is not correct. Extents are not like stripes. If anything, extents are more akin to concatenations than stripes. They do not rotate Virtual Machine assignments or even VM block or cluster allocation assignments across different extents in the datastore.

I think this misconception arises because it is being confused with how resource management does things on a VMFS volume. VMFS Resource Management attempts to separate cluster resources on a per host basis to avoid any lock contentions, etc. You may observe VMs from different hosts being placed at completely different offset locations on a VMFS datastore, and perhaps even different extents. My good friend Satyam Vaghani did a very good presentation on this at his VMworld 2009 session, TA3320. The hosts try to put distance between themselves on the datastore to reduce any contention for resources, but they still try to keep the objects that they manage close together.

The X-axis represents the volume layout. The files on the left are from host A, the files to the center right are from host B. As you can see, host B’s file are offset from the start of the volume. These offsets typically follow a uniform distribution across the entire file system address space. So, in a multi-extent situation, concurrent hosts accessing a VMFS volume will naturally spread themselves out across all extents (since the address space is a logical concatenation of all extents). In effect, if you have multiple hosts accessing a VMFS volume, the load may be distributed across multiple extents. Note that if it is a single host or a very small number of hosts on a very large number of extents, the load may not be evenly distributed.

Not only that, but the resource manager also tries to allocate a contiguous range of blocks to disk files (thin, thick, eagerzeroed thick) as can be seen by this slide also taken from Satyam’s presentation.

See this example of VMs with different virtual disks deployed to a VMFS-3:

Here the X axis is the volume layout, and the Y axis represents the number of blocks. Of course, as available space reduces, you could find a VM spanning two or more datastores on an extent. The same could be true for thin disks, which might need to allocate its next resource cluster from another extent.

Because of this contiguous allocation of space (which can be in the order of 100MBs or even GB), VMFS does not suffer from the traditional fragmentation issues seen on other filesystems. However, if a file that is grown by Host A at t0 is later grown by Host B at t1, and the same resource distribution scheme per host is in play, then it is likely that file block clusters for those files will be scattered across the logical address space. When you think about DRS management of VMs, and the use of thin disks, you can see that those disks will end up getting resources from various regions. Its still not enough to raise concerns about fragmentation however.

Misconception #2 – Losing one extent will offline the whole volume

Not completely true. Back in the VMFS-2 days, this was certainly true, but significant enhancements have been made to VMFS extents over the years that will allow a datastore to stay online even if one of its extent components is offline. See this posting I made on such enhancements. Now, we as yet don’t have this surfaced as an alarm in vCenter, but it definitely something we are looking at exposing at the vCenter layer in a future release.

However if the head extent (1^st member) has a failure, then it can bring the whole datastore offline. Head extent offline condition is pretty much always going to cause failures because many of the address resolution resources are on the head extent. Additionally, if a non-head extent member goes down, you won’t be able to access the VMs whose virtual disks have at least 1 block on that extent.

But is this really any more problematic than having an issue with a LUN which backs a single extent VMFS volume? For the most part, no. Its only when the head has an issue that this has more of an impact.

Misconception #3 – Its easy to mistakenly overwrite extents in vCenter

I’ve heard this still being brought up as an issue. Basically, the scenario described is where vCenter shows LUNs (which are already used as an extents for a VMFS datastores) as free, and will let you initialize them when you do an Add Storage task.

If memory serves, the issue described here could be as old as Virtual Center 1.x (this was in the days before we started calling it vCenter). I’m pretty sure that this was resolved in version 2.x, and definitely is not an issue with the vCenter 4.x & 5.x releases. I think this occurred when you built an extent on one host, and then flipped onto a view from another ESXi host which didn’t know that the LUN was now in use. These days, any changes made to a datastore, where a LUN is added as an extent, updates all the inventory objects so that this LUN is removed from the available disks pool. Coupled with the fact that we now have a cluster wide rescan option for storage, there should no longer be any concerns around this.

Obviously, if you decide to start working outside of vCenter and decide to work directly on the ESXi hosts, you could still run into this issue. But you wouldn’t do that, would you? 😉

Misconception #4 – You get better performance from extents

This is an interesting one. It basically suggests that using extents will give you better performance, because you have an aggregate of the queue depth from all extents/LUNs in the datastore. There is some merit to this. You could indeed make use of the per device queue depth to get an aggregate queue depth for all extents. But this is only relevant if a larger queue depth will improve performance, which may not always be the case. I also think that to benefit from the aggregate queue depth, each of the extents/LUNs that makes up the volume may have to be placed on different paths, or possibly you may need to implement Round Robin, which not every storage array supports. So this doesn’t just work out of the box; there is some configuration necessary.

My thoughts on this are that if you are using a single extent datastore, and think a larger queue depth will give you more performance, the you can simply edit the per device queue depth and bump it up from the default value of 32 to, say 64. Of course, you should do research in advance to see if this will indeed improve your performance. And keep in mind that the max queue depth of your HBA and the number of paths to your device need to be taken into account before making any of these changes.

VMware’s Best Practice/Recommendation for Extents

I discussed this with our engineering and support folks, and in all honesty, considering the management overhead with extents, the new single extent VMFS-5 volume size, the ability to grow datastores online with the volume grow facility, and the ability to tune the per device queue depth, the recommendation would be to avoid the use of extents unless you absolutely have to use them. In fact the only cases I can see where you might need extents are:

You are still on VMFS-3 and need a datastore larger than 2TB.
You have storage devices which cannot be grown at the back-end, but you need a datastore larger than 2TB.

There is nothing inherently wrong with extents, but the complexity involved in managing them has given them a bad name. Consider a 32 host cluster which shares a VMFS volume comprised of 32 extents/LUNs. This volume has to be presented to each host. It becomes quite a challenge to ensure that each host see the same LUN in exactly the same way. Now bring something like SRM (Site Recovery Manager) into the picture, and if you want to failover successfully to a remote site, all of these LUNs needs to be replicated correctly, and in the event of a failover, they may need to be resignatured and mounted on all the hosts on the DR site. So this becomes a formidable task. And it is primarily because of the complexity that I make this recommendation. VMFS-5 does provide better management capabilities by allowing for these larger LUN sizes, which makes a significant amount of the storage administration overhead go away.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage