Hello Virtual World!
I’m writing this week’s column a little early, so my fan (Hi Mom!) will have something to read next week, as I am taking a short vacation. It’s a personal tradition of mine that I never work my birthday, haven’t for many years. (I’m stubborn like that!) So the ESX systems of the world will have to keep running without my meddling for a time!
It’s one of the peculiar ironies of my life that I have a tendency to learn things too late to really make use of them. One could very easily argue that I don’t have a monopoly on that. It’s a natural by-product of aging and wisdom, and happens to everyone. More specifically, in my last job I ran a SAN reliability lab. It was my job to run 30 some odd servers engaged in more or less random read/write operations to approx 2400 hard drives. When they broke, I was expected to call the attention of the local engineering staff to the problem. Over the course of this job, I learned something. To all my friends who are engineers, I apologize, but I’m about to characterize your entire profession under a stereotype. Engineers don’t like, “it’s broke.”
Engineers like data, they like hard facts. They like information so reliable you can practically scratch a window with it. They like recorded, quantifiable statistics they can browse, copy, tweak, manipulate, reproduce and experiment with. More often than not, they like this kind of information, because it’s easy to fix, and even if they can’t fix it, it gives them something to build on. The really insidious part of all of this is the attitude is infectious. After I started this job at VMware, I realized I’d caught the virus.
It’s super common that we’ll get requests virtually every day from folks who’ve observed some sort of anomalous behavior from their servers which they must account for, and they expect our help in doing that. We do our best to provide that help when it’s asked for, and we frequently ask for diagnostic data/vm-support bundles from the affected host. In fact, I would go so far as to suggest that it’s not a bad idea to automatically include these logs with EVERY support request you send to VMware. (VMware KB 1008524) However, there seems to be a lot of common misconceptions about what it is we can actually learn from those logs, here are a few of them. (Listed in no particular order)
#1 My VM’s are behaving strangely! Check the logs please!
This is a common question, and it’s worth exploring, but it’s important to try to not put all your eggs in one basket, because ESX does not do a lot of logging at the operating system level! By and large, we concentrate on setting up the sandbox, and what the kids do within it, is their business. That’s not to say we can’t learn things about events on a VM by examining its corresponding vmware.log, but it’s important to have realistic expectations. If you find yourself in the position of having to call VMware with this question, be prepared for us to ask you some questions right back.
How precisely is it behaving strangely? How long has it been behaving strangely? What’s changed? (There is ALWAYS SOMETHING, even if we don’t yet know what). What are the characteristics of the VM? What operating system does it use? How often does it show this kind of behavior? How many VM’s are showing this behavior? (What do they have in common?)
(Wow, if any of the engineering staff from my previous job are reading this column, they’re probably laughing themselves silly right now.)
#2 There are timestamp gaps in my logs! ESX must have crashed!
This is another common one. It’s true that ESX is a prodigious logger, and under normal operating circumstances, there will never be a time when it doesn’t have something to say, but if there are gaps, you can’t necessarily assume the host was down. All that a gap proves is that the system was in a state that ESX couldn’t write to its own system files. There are multiple ways that might happen; hardware failure is the most common. I spoke with someone just this morning with substantial gaps in his logs; he’d discovered bad dimms on the host.
Gaps in the system log imply that something’s wrong with the hardware, but they don’t always PROVE it. The point is, when ESX CRASHES, you’ll know it. That’s how it was designed. You’ll get one of our famous “purple screens” (VMware KB 1006802 or KB 1009525 = examples). A purple screen is ESX’s way of saying that it’s run out of options. There are no remaining courses of action that don’t pose substantial risk of data corruption, and for obvious reasons, ESX doesn’t “roll” like that. When the system hangs, it’s reasonable to conclude that ESX can’t write to its system files. If it can’t write to system files, then it’s reasonable to assume we can’t write a crash dump either. So if you find yourself in this state, BEFORE you reboot the host (which we do understand you must do), ask yourself some fairly critical questions. After all, if you don’t ask them, we’ll probably ask you instead.
What’s the date, time, and fully qualified server name? Are there any scheduled jobs running in the environment? (such as backup jobs or batch jobs? )
- Is the service console pingable?
- Is VMkernel pingable?
- Are the virtual machines pingable?
- Can I connect to the host in a VSphere client with Virtual Center?
- Can I connect to the host in a VSphere client without Virtual Center?
- Can I connect to the physical console either directly, or via ILO?
If ANY of these things are true, don’t reboot right away, give us a call, we may be able to help figure out a way out!
#3 My ESXi host broke, Check the logs!
We (VMware’s technical support engineers) almost always flinch when we get these requests. I covered this in a previous blog, but it’s worth reiterating.
ESXi lives in RAMdisk!
We can certainly understand a need for urgency, and if you need to recover your systems, you may have to reboot, but you also need to understand, because ESXi does live in RAMdisk its VERY COMMON that the logs will contain NOTHING prior to the reboot. Please understand, be aware of the risk, and don’t be surprised if your technician tells you there’s nothing to retrieve. ESXi HAD to give up something to get its footprint size. Our engineers aren’t magicians; they worked hard to cram as many functions as they did into such a small memory footprint. The bottom line is some hard choices had to be made, and some of the logging and diagnostic capability didn’t make the cut. That’s why it’s that much more important to use the capabilities ESXi does have, namely, syslog and the vMA (vSphere management assistant). Trust me, if you use ESXi, put them both in your tool belt. It’ll only take ONE outage incident for an external syslog server to pay for itself. It might not happen today, or tomorrow, but someday you’ll be glad you had them.
That’s all for today, barring acts of god, the economy, VMware’s management and Prince (who apparently believes the internet is “over” http://tinyurl.com/232jyv2) I’ll see you all in a few weeks.
Live well!