A frequent conversation I have with customers is how vSAN snapshots differ from standard VMFS Snapshots. In this post, I want to identify the architectural differences and how they significantly enhance vSAN performance when compared to previous virtual machine snapshot implementations.
vmfsSparse, commonly referred to as the redo log format, is the original snapshot format used by VMware. It is the format used on VMFS, NFS (without VAAI-NAS) and vSAN 5.5.
When a snapshot is taken of a base disk using the redo log format, a child delta disk is created. The parent is then considered a point-in-time (PIT) copy. The running point of the virtual machine is now the delta. New writes by the virtual machine go to the delta, but the base disk or other snapshots in the chain satisfy reads.
One major concern with vmfsSparse/redo log snapshots is that they can negatively affect the performance of a virtual machine. Performance degradation is based on how long the snapshot or snapshot tree is in place, the depth of the tree, and how much the virtual machine and its guest operating system have changed from the time the snapshot was taken. A consolidate operation (deleting or converting snapshots) can be very time consuming depending on the amount of snapshot deltas and how many changes need to be rolled back into the base vmdk.
Also, you might see a delay in the amount of time it takes the virtual machine to power-on. This is why VMware does not recommend running production virtual machines from snapshots using redo log format on a permanent basis. This is why VMware KB article KB1025279 recommends no more than three vmfsSparse snapshots and retaining snapshots no longer than 72 hours.
With vSAN, when a virtual machine snapshot is created instead of the vmfsSparse/redo log object, vsanSparse delta objects get created. The goal of the vsanSparse snapshot format is to improve snapshot performance by continuing to use the existing redo logs mechanism but also utilizing an “in-memory” metadata cache and a more efficient sparse filesystem layout.
How vsanSparse works
With vSAN, VMs are made up of objects. A delta disk (snapshot) object is made up of a set of grains, where each grain is a block of sectors containing virtual disk data. A VMDK object backs each delta. The deltas keep only changed grains, so they are space-efficient.
In the diagram below, the Base disk object is called Disk.vmdk and is at the bottom of the chain. There are three snapshot objects (Disk-001.vmdk, Disk-002.vmdk and Disk-003.vmdk) taken at various intervals and guest OS writes are also occurring at various intervals, leading to changes in snapshot deltas.
- Base object – writes to grain 1,2,3 & 6
- Delta object Disk-001 – writes to grain 1 & 4
- Delta object Disk-002 – writes to grain 2 & 4
- Delta object Disk-003 – writes to grain 1 & 6
A read by the VM will now return the following:
- Grain 1 – retrieved from Delta object Disk-003
- Grain 2 – retrieved from Delta object Disk-002
- Grain 3 – retrieved from Base object
- Grain 4 – retrieved from Delta object Disk-002
- Grain 5 – retrieved from Base object - 0 returned as it was never written
- Grain 6 – retrieved from Delta object Disk-003
Consider the case when a snapshot has been taken of a virtual machine. When a guest OS sends a write to disk, the vsanSparse driver receives the write. Writes always go to the top-most object in the snapshot chain. When the write is acknowledged, the vsanSparse driver updates its “in-memory” metadata cache and confirms the write back to the guest OS. On subsequent reads, the vsanSparse driver can reference its metadata cache and on a cache hit, immediately locate the data block.
Reads are serviced from one or more of the vsanSparse deltas in the snapshot tree. The vsanSparse driver checks the “in-memory” metadata cache to determine which delta or deltas to read. This depends on what parts of the data were written in a particular snapshot level. Therefore, to satisfy a read I/O request, the snapshot logic does not need to traverse through every delta of the snapshot tree but can go directly to the necessary vsanSparse delta and retrieve the data requested. Reads are sent to all deltas that have the necessary data in parallel.
On a cache miss, however, the vsanSparse driver must still traverse each layer to fetch the latest data. This is done in a similar way to read requests in that the requests are sent to all layers in parallel.
The vsanSparse in-memory cache initially has “unknown” ranges. In other words, the cache is cold. When there is a read request from an unknown range, a cache miss is generated. This range is then retrieved and cached for future requests. As you might imagine, a cache miss increases the I/O latency.
Cache is in-memory and never committed to persistent storage. This means if there is a host failure, or a VM power off or reboot, the cache is emptied. When the VM returns and generates I/O the cache will refill.
Unlike vmfsSparse snapshots, vsanSparse snapshots have no retention limit, however, we recommend regularly checking the read cache usage and the vSAN datastore capacity when using snapshots for long periods of time.
Snapshots are a great tool but it's important to monitor the snapshots to ensure they do not grow too large. To avoid potential performance issues consider setting up a vCenter Server alarm to alert when a VM is running on a snapshot.
The vsanSparse snapshot format provides vSAN administrators with enterprise-class snapshots and clones. The purpose is to improve snapshot performance by continuing to use the existing redo logs mechanism but also utilizing an “in-memory” metadata cache and a more efficient sparse filesystem layout. For more details be sure to read the Introduction to vsanSparse snapshots tech note.