Posted by Cormac Hogan
Technical Marketing Architect (Storage)
In one of my very first blog posts last year around the new improvements made to VMFS-5 in vSphere 5.0, one of the enhancements I called out was related to the VAAI primitive ATS (Atomic Test & set). In the post, I stated that the 'Hardware Acceleration primitive, Atomic Test & Set (ATS), is now used throughout VMFS-5 for file locking.' This recently led to an obvious, but really good question (thank you Cody) – What are the operations that ATS now does in vSphere 5.0/VMFS-5 that it didn’t do in vSphere 4.1/VMFS-3?
Well, before we delve into that, I thought it might be a good idea to provide a general overview of VMFS locking.
Heartbeats
VMFS is a distributed journaling filesystem. All distributed file systems need to synchronize operations between multiple hosts as well as indicate the liveliness of a host. In VMFS, this is handled through the use of an ‘on-disk heartbeat structure’. The heartbeat structure maintains the lock states as well as the pointer to the journal information.
In order to deal with possible crashes of hosts, the distributed locks are implemented as lease-based. A host that holds a lock must renew a lease on the lock (by changing a "pulse field" in the on-disk lock data structure) to indicate that it still holds the lock and has not crashed. Another host can break the lock if the lock has not been renewed by the current holder for a certain period of time.
When another ESXi host wants to access a file, it will check to see if that timestamp has been increased. If it has not been increased, this host can take over ownership of the file by removing the stale lock, place its own lock on the file, and generating a new timestamp.
On-disk Metadata Locks
VMFS has on-disk metadata locks which are used to synchronize metadata updates. Metadata updates are required when creating, deleting or changing file metadata, in particular, file length. Earlier versions of VMFS (built on LUNs from non-VAAI arrays) use SCSI reservations to acquire the on-disk metadata locks. It is important to note that the SCSI reservations are not in place to do the actual metadata update. They are used to get the on-disk lock only, and once the lock has been obtained, the SCSI reservation is released. Therefore, to address a common misconception, the LUN is not locked with a SCSI reservation for the duration of metadata updates. They are only reserved to get the on-disk lock.
Acquiring on-disk locks is a very short operation compared to the latency of metadata updates. However you should not infer from this that metadata updates themselves are long; it is simply that metadata updates are typically longer than the the time taken to acquire the lock.
Optimistic Locking
The VMFS version released with ESX version 3.5 introduced an updated distributed locking called 'optimistic locking'. Basically, the actual acquisition of on-disk locks (involving SCSI reservations) is postponed as late as possible in the life cycle of a VMFS metadata transactions. Optimistic locking allows the number and duration of SCSI reservations to be reduced. This in turn reduces the impact of SCSI reservations on Virtual Machine I/O and other VMFS metadata I/O originating from other ESX hosts that share the volume.
When locks are acquired in non-optimistic mode, one SCSI reservation is used for each lock that we want to acquire.
In optimistic mode, we use one SCSI reservation for each set of locks required by a particular journal transaction. Optimistic locking uses one SCSI reservation per transaction as opposed to one SCSI reservation per lock.
We don't commit the transaction to the on-disk journal unless we are able to update all optimistic locks to physical locks (using a single SCSI reservation). You may see the message: Optimistic Lock Acquired By Another Host. This means that a lock which was held optimistically (not yet acquired on-disk) during a transaction was found to have been acquired on-disk by a different host. If we are unable to do said update we roll back the transaction's in-memory changes (no on-disk changes would have been made) and then we simply retry the transaction. But, as per the description, we are optimistic that this won't occur very often, and that in the vast majority of cases, optimistic locks will be upgraded to physical locks without any contention.
VAAI ATS (Atomic Test & Set)
In VMFS, as we have seen above, many operations need to establish a critical section on the volume when updating a resource, in particular an on-disk lock or a heartbeat. The operations that require this critical section can be listed as follows:
1. Acquire on-disk locks
2. Upgrade an optimistic lock to an exclusive/physical lock.
3. Unlock a Read Only/Multi-Writer lock.
4. Acquire a heartbeat
5. Clear a heartbeat
6. Replay a heartbeat
7. Reclaim a heartbeat
8. Acquire on-disk lock with dead owner
This critical section can either be established using SCSI reservation or using ATS on a VAAI-enabled array. In vSphere 4.0, VMFS-3 used SCSI reservations for establishing this critical section as there was no VAAI support in that release. In vSphere 4.1, on a VAAI-enabled array, VMFS-3 used ATS only for operations (1) and (2) above, and ONLY when disk lock acquisitions were un-contended. VMFS-3 fell back to using SCSI reservations if there was a mid-air collision when acquiring an on-disk lock using ATS.
For VMFS-5 datastores formatted on a VAAI-enabled array (i.e. as ATS-only), all the critical section functionality from (1) to (8) is done using ATS. We should no longer see any SCSI Reservations on VAAI-enabled VMFS-5. Even if there is contention, ATS continues to be used.
On non-VAAI arrays, SCSI reservations continue to be used for establishing critical sections in VMFS-5.
I hope this gives some clarity to the original statement about VMFS-5 in vSphere 5.0 now being fully ATS aware, and also gives you some idea of the types of locking used in various versions of VMFS.
Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage
Rob
Good Post Cormac,
I have a question about datastores that are upgraded in place as opposed to newly formatted. Do datastores that are upgraded, assume all of the benefits(1-8) above or only a subset?
Thanks,
Rob
Chogan
Great question Rob.
VMFS-3 volumes that are upgraded in-place to VMFS-5 are also fully ATS aware, and can use ATS for all 8 operations described above.
There is a complete description of VMFS-5 upgrade considerations/features discussed in this whitepaper (shameless plug) – http://www.vmware.com/resources/techresources/10242 🙂
Cormac
Mostafa Khalil
I think the reference was for “ATS only” flag on VMFS5 Datastores.
It is enabled on Freshly created datastores but not upgraded ones. Note that it is not enabled out of the box. When the host detects that the array supports ATS on the device, ATS Only flag is written to the datastore. From that point on, ATS will always be used on that datastore.
To manually enable it, you my use the hidden option:
vmkfstools –configATSOnly 1 [device path]
e.g.
vmkfstools –configATSOnly 1 /vmfs/devices/disks/[naa-id]:[partition-number]
Or
vmkfstools –configATSOnly 1 /dev/disks/[naa-id]:[partition-number]
However, it for whatever reason the storage array does not support ATS and you enable this flag manually, the datastore will not be mounted which is why it is not enabled by default.
If you need to disable the flag repeat the vmkfstools command using the value “0” instead of “1”.
Mikael
Hi Cormac, Great post and thanks for this explanation. Working with Cody in this wonderfull question and I wonder what would occur in terme of “contention” when you have a lot of storage vmotion between two lun’s and you are in the VMFS3 (ESX4.1) scenario? let’s suppose a 4 to 8 storage Vmotion scenario taking place in the same time on the same volume and the impact (fall back to SCSI2 commands)? you precise that only the (1) and (2) commands could take care of ATS in this case what about the other commands (3) to (8) in a VMFS3 scenario (ESX4.1)? Do they produce bad thing in the command (1) and (2)? Thanks
Mikael
Scott Langer
Awesome post. thanks. can you tell me, if an array is “VAAI enabled” will it automatically have ATS or are some arrays “VAAI Enabled” but they don’t have the full feature set…
I guess what Im wondering is: Must I ask my storage vendor, “does your array support ATS specifically?”
thanks.
Chogan
Hi Scott,
For newly created VMFS-5, ATS (if supported by the array) will be enabled by default.
For upgraded VMFS-5, please see the comment from Mostafa above.
To check if the storage array supports ATS, or indeed any of the VAAI primitives,you can use the following command:
esxcli storage core device vaai status get -d ‘:
naa.60a98000572d54724a346a6170627a52
VAAI Plugin Name: VMW_VAAIP_NETAPP
ATS Status: supported
Clone Status: supported
Zero Status: supported
Delete Status: supported
Chogan
Hi Mikael,
Thanks for commenting. You and Cody are doing some great work – thanks.
To your specific question, indeed contention could occur with ATS & SCSI reservations, but it is handled in the ATS primitive implementation.
One of the requirements in the design of ATS was to make it compatible with ESX hosts that use the legacy SCSI reservation based VMFS-3 lock manager.
The ATS primitive behaves like a regular read/write CDB on the wire and fails with a reservation conflict if another host has the LUN reserved using SCSI-2 or SCSI-3 reservations.
Nick
Great post and clears up a lot of the mysteries associated with SCSI reservations and LUN locking.
In the past (i.e. ESX 3.x and 4.0), we typically sized LUNs with block-based storage (FCP, iSCSI) using a very popular rule of thumb – and that was to limit the # of VMs on each LUN to about 20. So if each VM was 30 GB in size and generated moderate I/O, we might use ~600 GB LUNs for optimal performance. But with VAAI-capable arrays and VMFS 4.1 and 5.0, we can probably do better than that. My question is how many VMs per LUN now? Should we “cap” the # of VMs per LUN at 50 or so now with VAAI/ATS? Or is it 100? What do you think the new rule of thumb should be?
While I know it depends, I’d really like a range…similar to how we’ve said “20-30 VMs per LUN with mixed/moderate I/O” before. Thanks in advance and keep the articles coming.
-Nick
Chogan
Thanks for the nice comments Nick.
Certainly SCSI reservations were a limiting factor, and the introduction of ATS complete VMFS-5 should remove this as a consideration when it comes to sizing VM density per volume.
But there are too many mitigating factors for me to come up with a rule of thumb for number of VMs per datastore. What I will say is that if you were getting 20 VMs per datastore with SCSI reservations, and you now have an ATS capable datastore, then you should be able to increase the number of VMs.
But of course, the IOPs capability of the datastore, the latency & IOPs requirements of the apps in the datastore, and the sort of applications running in the VM should also be considered.
Chad
Hi Cormac,
What is the granularity of ATS locks within VMFS5 and how is space divided up between A) various ESX servers which have mounted the same VMFS5 datastore, B) Between various provisioning operation occurring simulateously on the same VMFS5 datastore from within the same ESX server. How many contiguous blocks does each lock guarantee? Further, how many locks can be granted at once?
Chad
Chogan
Hi Chad,
ATS locks are a mechanism to modify a disk sector, which when successful, allow an ESXi host to do a metadata update on a VMFS. This includes allocating space to a VMDK during provisioning, as certain characteristics would need to be updated in the metadata to reflect the new size of the file.
When it comes to space allocation, the last time I looked into this (VMFS v3.31), we allocate 200 file block resources with each lock. If we take a 1MB file block, a cluster contains 64 file blocks, so we get 200 * 64MB each time we grow a file on a VMFS. This may have changed in 5.0, but I haven’t heard about it if it did.
Some further information about the layout of VMDKs on VMFS can be found in this blog post – http://blogs.vmware.com/vsphere/2012/02/vmfs-extents-are-they-bad-or-simply-misunderstood.html.
I do not believe that we have a limit on the number of locks that can be granted to an ESXi host, or if we do, I susepct that it is reasonably high enough to prevent us reaching it during provisioning.
HTH
Cormac
Chad
What I’m curious about is this, are Atomic test and set the same as Compare And Write which is a target side operation: the SCSI COMPARE AND WRITE (CAW) command provides a means to write data without imposing the overhead of a SCSI Reservation (a LUN level lock).
Chogan
Yes – it is the same thing. ATS uses COMPARE & WRITE.
Jackei
ATS uses COMPARE & WRITE. The WRITE operation is atomic write?
Andy
I was just re-reading the VMFS-5_Upgrade_Considerations.pdf and a line on page 4 under Small File Support caught my eye. “VMFS-5 introduces support for very small files. For files less than or equal to 1KB, VMFS-5 uses the file descriptor location in the metadata for storage rather than file blocks.”
Now I understand that VMFS-5 tries to use ATS natively but if it can’t it will fall-back to SCSI-2 reservations.
Now if we consider that a metadata update requires a lock, and if your array does not support ATS, then any small file will require a SCSI-2 lock therefore potentially impacting scalability & performance.
I don’t think I have ever paid attention to the quantity of small files but now w/ datastore heartbeats and the plethora of other small files, how much of a concern is this?
Chogan
Hi Andy,
The ‘small file’ mechanism that was introduced in VMFS-5 is a space saving technique. Rather than consuming disk blocks, the information is stored within the metadata.
So the act of creating, modifying or deleting a small file from a locking perspective on VMFS-5 won’t have changed. If your array supports ATS, then yes, this locking procedure will be more efficient. As you state however, if your array does not support VAAI, then SCSI reservations will still have to be used to lock the LUN while the host places an exclusive lock on the file.
But small file support on VMFS-5 should not introduce any additional overheads or latency.
Andy
Thanks Cormac
Jack
Cormac, I have a question related to concurrent ATS request and read/write requests to the same blocks – does VMFS expect a concurrent ATS and another read/write request to be mutually exclusive? i.e. they must be executed by array in strict order. I know 2 ATS requests on the block range would be serialized, but curious how VMFS would expect for ATS vs. other read/write.
Clint Beilman
Hi Cormac,
We just got a Violin 6000 series array which does not currently support ATS and I’m trying to figure out how to size the LUNS. I’m currently on vSphere 5.0. Back before ATS, I kept less than 20 VMs on a LUN, but that with VMFS3. Should I follow the same guideline with VMFS5?
Thanks,
Clint
Cormac
This paper will help with that decision – http://www.vmware.com/resources/techresources/996
Natural Cleanse Weight Loss
Excellent goods from you, man. I’ve take into account your stuff previous to and you are just too wonderful. I really like what you’ve acquired right here, really
like what you’re saying and the way in which in which you are saying it. You are making it entertaining and you continue to take care of to keep it smart. I cant wait to learn much more from you. This is really a great web site.
Automated pay days
Great items from you, man. I’ve have in mind your stuff previous to and you’re simply too great.
I really like what you’ve obtained here, really like what you’re
saying and the best way wherein you assert
it. You are making it entertaining and you still care for to keep it sensible.
I can not wait to read far more from you. This is actually a wonderful web
site.
Wrinkle cream
I’ll right away grab your rss as I can’t find your e-mail subscription hyperlink or e-newsletter service.
Do you’ve any? Kindly permit me know in order that I may subscribe. Thanks.
garcinia cambogia available in canada
Thanks for finally writing about > VMFS Locking Uncovered | VMware vSphere Blog – VMware Blogs < Loved it!
low testosterone in women
I do agree with all the ideas you’ve introduced in your post.
They are very convincing and will certainly work. Still, the posts are very
quick for starters. May you please extend them a little from next time?
Thank you for the post.
tube.ishtartv.com
Hey just wanted to give you a quick heads up.
The words in your post seem to be running off
the screen in Chrome. I’m not sure if this is a format
issue or something to do with web browser compatibility but I thought I’d post to let you know.
The design and style look great though! Hope you get the problem
fixed soon. Cheers
web page
I need some knowledge of carpentry before I go up north to learn how to build log homes.
I don’t know anything about how to build a house! haha. .
Any suggestions?. . Thanks in advance!.
website
What does the term “on center” mean when talking about carpentry?
site
What college do I go to for carpentry?
hydroxycitric acid in garcinia cambogia burns
I doo not even know how I ended up here, but I thought this posxt wass good.
I don’t khow who you are but definiterly you are going too a famous
blogger if you aren’t already 😉 Cheers!
Also visit my blog post –hydroxycitric acid in garcinia cambogia burns
Arianne
ZM
Abe
KI
Daniella
ET
Emmanuel
FQ
Neelima Bandla
Hi Cormac,
It looks like our vm’s go offline when there is latency issues aquiring locks ( Storage array), it looks like ESX has 15 seconds watchdog timeout determining if lock is lost or not. Is the 15 seconds timeout for VAAI ATS command? Can you please confirm if that is the timeout for ATS command. Also, can this be changed from the default.
thanks,
Neelima
Ramya Victor
What is the exact difference between the in memory , on disk and optimistic lock?
Gy
hy
Nate Hudson
If you need to disable VMFS hearbeats being controlled by VAAI in ESXi 5.5U2, you can do it the following way http://nrhudson.blogspot.com/2015/04/disabling-vaai-vmfs-heartbeats-in.html
As of ESXi5.5U2 VAAI now controls the VMFS clustering hearbeat and if you need to revert back to scsi as the default heartbeat control as what was previously in ESXi 5.5U1, you can do it via esxcli advanced settings.
Jackei
Hi,Cormac
ATS only support storage Array ? PCIE SSD device is not supported?
Wee
Thanks for the superb information!
If we were to disabled hardware accelerated locking(ATS) on our hosts operating on ESXi 5.5 U2. Would these host/s still be able to access volumes setup as ATS-only.
Big thanks in advance.