Product Announcements

VMFS Locking Uncovered

Cormac_Hogan
Posted by Cormac Hogan
Technical Marketing Architect (Storage)

In one of my very first blog posts last year around the new improvements made to VMFS-5 in vSphere 5.0, one of the enhancements I called out was related to the VAAI primitive ATS (Atomic Test & set). In the post, I stated  that the 'Hardware Acceleration primitive, Atomic Test & Set (ATS), is now used throughout VMFS-5 for file locking.' This recently led to an obvious, but really good question (thank you Cody) – What are the operations that ATS now does in vSphere 5.0/VMFS-5 that it didn’t do in vSphere 4.1/VMFS-3?

Well, before we delve into that, I thought it might be a good idea to provide a general overview of VMFS locking.

Heartbeats

VMFS is a distributed journaling filesystem. All distributed file systems need to synchronize operations between multiple hosts as well as indicate the liveliness of a host. In VMFS, this is handled through the use of an ‘on-disk heartbeat structure’.  The heartbeat structure maintains the lock states as well as the pointer to the journal information. 

In order to deal with possible crashes of hosts, the distributed locks are implemented as lease-based.  A host that holds a lock must renew a lease on the lock (by changing a "pulse field" in the on-disk lock data structure) to indicate that it still holds the lock and has not crashed. Another host can break the lock if the lock has not been renewed by the current holder for a certain period of time.

When another ESXi host wants to access a file, it will check to see if that timestamp has been increased. If it has not been increased, this host can take over ownership of the file by removing the stale lock, place its own lock on the file, and generating a new timestamp.

On-disk Metadata Locks

VMFS has on-disk metadata locks which are used to synchronize metadata updates.  Metadata updates are required when creating, deleting or changing file metadata, in particular, file length. Earlier versions of VMFS (built on LUNs from non-VAAI arrays) use SCSI reservations to acquire the on-disk metadata locks. It is important to note that the SCSI reservations are not in place to do the actual metadata update. They are used to get the on-disk lock only, and once the lock has been obtained, the SCSI reservation is released. Therefore, to address a common misconception, the LUN is not locked with a SCSI reservation for the duration of metadata updates. They are only reserved to get the on-disk lock.

Acquiring on-disk locks is a very short operation compared to the latency of metadata updates. However you should not infer from this that metadata updates themselves are long; it is simply that metadata updates are typically longer than the the time taken to acquire the lock.

Optimistic Locking

The VMFS version released with ESX version 3.5 introduced an updated distributed locking called 'optimistic locking'. Basically, the actual acquisition of on-disk locks (involving SCSI reservations) is postponed as late as possible in the life cycle of a VMFS metadata transactions. Optimistic locking allows the number and duration of SCSI reservations to be reduced. This in turn reduces the impact of SCSI reservations on Virtual Machine I/O and other VMFS metadata I/O originating from other ESX hosts that share the volume.

When locks are acquired in non-optimistic mode, one SCSI reservation is used for each lock that we want to acquire.

In optimistic mode, we use one SCSI reservation for each set of locks required by a particular journal transaction. Optimistic locking uses one SCSI reservation per transaction as opposed to one SCSI reservation per lock.

We don't commit the transaction to the on-disk journal unless we are able to update all optimistic locks to physical locks (using a single SCSI reservation). You may see the message: Optimistic Lock Acquired By Another Host. This means that a lock which was held optimistically (not yet acquired on-disk) during a transaction was found to have been acquired on-disk by a different host. If we are unable to do said update we roll back the transaction's in-memory changes (no on-disk changes would have been made) and then we simply retry the transaction. But, as per the description, we are optimistic that this won't occur very often, and that in the vast majority of cases, optimistic locks will be upgraded to physical locks without any contention.

VAAI ATS (Atomic Test & Set)

In VMFS, as we have seen above, many operations need to establish a critical section on the volume when updating a resource, in particular an on-disk lock or a heartbeat. The operations that require this critical section can be listed as follows:

1.    Acquire on-disk locks
2.    Upgrade an optimistic lock to an exclusive/physical lock.
3.    Unlock a Read Only/Multi-Writer lock.
4.    Acquire a heartbeat
5.    Clear a heartbeat
6.    Replay a heartbeat
7.    Reclaim a heartbeat
8.    Acquire on-disk lock with dead owner

This critical section can either be established using SCSI reservation or using ATS on a VAAI-enabled array. In vSphere 4.0, VMFS-3 used SCSI reservations for establishing this critical section as there was no VAAI support in that release. In vSphere 4.1, on a VAAI-enabled array, VMFS-3 used ATS only for operations (1) and (2) above, and ONLY when disk lock acquisitions were un-contended. VMFS-3 fell back to using SCSI reservations if there was a mid-air collision when acquiring an on-disk lock using ATS.

For VMFS-5 datastores formatted on a VAAI-enabled array (i.e. as ATS-only), all the critical section functionality from (1) to (8) is done using ATS. We should no longer see any SCSI Reservations on VAAI-enabled VMFS-5. Even if there is contention, ATS continues to be used.

On non-VAAI arrays, SCSI reservations continue to be used for establishing critical sections in VMFS-5.

 

I hope this gives some clarity to the original statement about VMFS-5 in vSphere 5.0 now being fully ATS aware, and also gives you some idea of the types of locking used in various versions of VMFS.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage