Today we have another guest post from Tech Support Engineer Mike Bean speaking casually about snapshots, a commonly misunderstood piece of the VMware solution.
If you ever want to make your VMware support representative cringe, just tell him or her you’re calling about a snapshot problem. Snapshots are very high on the list of misunderstood features, and to complicate things, snapshot problems often result in data loss, and let’s be honest, data loss is never funny.
ESX anatomy 101
To understand how snapshots operate, it’s important to understand the composition of your average virtual machine. To be sure, various virtualization architectures exist, but VMware’s is fairly straightforward. Every virtual machine consists of two parts, a *.vmx, and a *.vmdk. You’ll fairly frequently see other components, but in the end, if you do not have a *.vmx, and a *.vmdk, you don’t have a virtual machine. As we dive a little deeper, the *.vmdk consists of two parts:
1) <File>.vmdk – This, in the jargon, is called the descriptor. It is, what it sounds like. This is the file that contains the characteristics of the disk, if it’s lost, it can be re-created.
2) <File-flat>.vmdk-flat – This is the actual disk. This is the money file. It is the deal breaker. The buck very definitely stops here. If the data is damaged, do not pass go, do not collect $200, just restore from a backup.
So, ultimately, our metaphorical VM will look something like this:
Next, let’s add some secret sauce, and start taking some snapshots. ESX creates another descriptor, and starts creating a “changes” or delta file. The “changes” file is a continuous record of the block level changes to the disk. This is an important concept. A VMware snapshot, unlike SAN based snapshots, ARE NOT COPIES. Most everywhere you look, a snapshot, is a copy or an image. The typical assumption is that if something goes wrong with your disk or backup, you can revert to the image. In ESX, that dog won’t hunt.
As you continue to work, your changes are recorded in the delta file. If the original disk is hypothetically damaged, you CANNOT revert to the snapshot, because the snapshot is not an autonomous disk; and removing the changes will not repair the damage. (We can’t always know what caused the damage in the first place).
Let’s add some additional snapshots to the mix. Take an additional snapshot, and what you’re really doing, is tracking the block level changes between the first snapshot, and the VM’s current state. It doesn’t take a VMware Technical Support Engineer to see how this can get out of control very-quickly. We call these structures “snapshot chains”.
Take a look at our snapshot chain. Let’s, for argument’s sake, poke a hole in it, and damage one of the delta files. UX/LX administrators out there will probably remember their old textbooks that discuss the difference between absolute and relative paths. The “changes” files are relative paths, and because one of the “mile markers” is now, for want of a better term, damaged, ALL of the changes data below the damage is now suspect.
Generally speaking, have a problem with snapshot 3, and you’re fine, just revert to snapshot 2. If you have a problem with snapshot 2, snapshot 3 is now entirely unreliable, because the changes it records, no longer apply. Have a problem with snapshot 1, and snapshots 2 AND 3 are now suspect!
I’m sure you can see how this could lead to some unhappy people having unpleasant conversations! To illustrate, an office co-worker of mine got a call once from a company trying to recover a corrupted/damaged base disk that had YEARS worth of snapshots. It didn’t end well.
By now it should be readily apparent why snapshots do not make good backups. More to the point, it’s just not good digital asset management. A good backup infrastructure has to be able to stand on its own two feet, a spare tire in the trunk won’t help you if the check-engine light in your car comes on. In that sense, I’d like to propose an alternative way of thinking about the subject.
Engineers/software nerds fairly commonly use a concept, for want of a better term, we’ll call it version control. The code exists in a main branch or trunk. Write a new feature, or code a new bug-fix, and check the new code into the “build”. If the new bug-fix doesn’t work out, back it out. Use the build prior to the fix, however, ultimately, if the new bug-fix DOES work out, that, in essence, BECOMES the new build.
Humbly, I suggest we emulate this kind of thinking. Use snapshots not to create backups for your VM’s, but as a form of version control. Snapshots are intended for short term use only. Got an OS patch coming for a critical VM? Take a snapshot and wait a couple days, perhaps a week. Once you’re certain the patch is viable and won’t cause excessive disruption, remove the snapshot! I spoke to a customer once who had setup something he called his “nag script”. It routinely checked for the presence of snapshots older then a given interval, and began emailing the VM’s custodians on a regular basis to remind them to remove it. SMART. If I’d had an ESX infrastructure of my own, I would’ve asked if he’d be willing to share the code for his “nag script”. I can’t even begin to describe how many support calls to Support could be avoided completely with simple, faithful adherence to this principle. Don’t misunderstand me, when used as version control, snapshots can be a powerful tool. My primary goal in writing this article is not to discourage snapshot use, but to encourage responsible snapshot use, and try to impart some sense of WHY it’s important. I’ve said it before in previous articles and I’ll say it again, ultimately, the only safe policy is one of shared information (informed consent). I’ve spoken with numerous customers over the years who’ve viewed their support calls as an opportunity to learn/ask questions, and I’ve always tried to encourage that attitude. Sometimes they want to understand what I’m doing, sometimes they want to record the webex session, sometimes they just want to take notes, and we do our best to respond in kind! Ultimately, we’re all on the same side!
Until next time!
Addendum: Special thanks to Lisa Bernhardt (GSS, Storage Team) for helping translate!
Practice what you preach!!
Put the nag script INTO this blog post!!
Alan Renouf posted a SnapReminder script here http://www.virtu-al.net/2009/06/22/powercli-snapreminder/
Very useful script to remind users to delete snapshots. You will then be named the “Snapshot Nag” after you implement the script.
VizionCore’s Ecoshell has a nice reporting tool that generates a list of snapshots. I then email this to my manager who in turn contacts the appropriate custodians of the VM’s to remind them to remove their snapshots. I manually generate the report on the 1st business day of the month!