Today we have an interesting take today from one of our Trainers, Linus Bourque. Linus once worked on the front lines in Support, but now trains customers in troubleshooting and our products as well (more bio info at the bottom of this post).
One of my favorite IRC quotes comes from bash.org-
<erno> hm. I’ve lost a machine.. literally _lost_. it responds to ping, it works completely, I just can’t figure out where in my apartment it is.
Although I can admit to never having to run into this issue when troubleshooting, it does highlight something needed by all within the I.T. community, and that is the ability to troubleshoot issues; is of the most powerful skills that anyone needs. While we can learn pretty much any product, how to deal with it when things go wrong is another skill set entirely.
For some, this often appears as if some magically divined talent was bestowed upon others. For others, it feels like a large project. In fact, it is neither. While I cannot instill in you the joy of hunting for solutions to the puzzles I received when I was part of VMware’s Global Support Staff (GSS) as an Escalation Engineer (at the Burlington, ON. offices), I can certainly speak a little bit to the concept of troubleshooting methodology.
When I’ve taught this previously, it was referred to as ACE: Analysis, Cause and Effect. These three steps make up the main part of troubleshooting and are very repeatable. Let’s start with the first, Analysis.
Analysis is truly the most critical portion of the troubleshooting trifecta since it forms the basis of all decisions afterwards. The challenge is to truly determine what the problem is. If you’ve ever seen the YouTube video “The Website is Down”, you’ll notice that part of the troubleshooting problem with that parody is that the tech didn’t truly figure out what the sales guy meant by “the website is down”. This results in a comedy of errors afterwards but while funny in a video like this, it’s not something that you want to experience. So while asking, “what is the problem” is a good start, it won’t necessarily indicate what the true problem is. It may even become a situation where the understanding of a problem is akin to determining the true nature of an ogre: it’s like an onion with many layers to peel back until you find the core.
Part of that means knowing what the systems look like when they are running well. For ESX/ESXi, this means examining certain logs and behaviors on a regular basis. Logs like /var/log/vmkernel and /var/log/vmkwarning are good places to start. Additionally, get familiar with both esxtop and top (in the case of ESX especially) to help you learn what is normal versus what isn’t so normal. Most errors will appear in either of those and looking at them (you can even examine them after getting a support bundle from the server). The same can be said for vCenter (logs for vCenter can be found at <drive>:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\logs\). Although in vCenter’s case the errors and “good stuff” all appear under the same log so searching may help. If you have a variety of VMware enterprise products (say, ESX, ESXi, vCenter, View, etc.) you can make log amalgamation easier by using a log consolidation tool like splunk. Many of my View clients swear by it.
It is from here that you will be able to determine what the problem is and note what the symptoms are and/or figure out if any changes were done recently to cause this issue. I think most of us are familiar with “Did you change anything recently on the system?” kinds of questions we often ask relatives when they have problems with their systems. The same is true in large environment. Ideally, there is a change process procedure in place but in some environments it can be a challenge (especially when you are doing all the hats that I.T. requires).
Once you know what the problem is, then you determine the cause. Now, this is true if data damage has not been done. If data has been compromised or is lost, then this step may be slightly behind a recovery of the environment first. And if recovery must be done then the original “broken” server should be set aside so that the cause part of it can be done after an environment is restored to working order.
Some causes may be easy to pick out (e.g., unplugged server, hard drive failures, smoke coming from a system) while others are a little more challenging (e.g., adding a newly installed ESXi system to vCenter fails). For this, you need to be like Sherlock Holmes who infamously was quoted as saying-
“Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.”
In this case, we start by eliminating the obvious, which can also be difficult sometimes at seeing since it is right in front of us; and then go after the more difficult not-so-obvious. For example, in regards to the ESXi server, newly scripted installed, we figured the script wasn’t finishing so we tried a reboot and it did then allow for the agent to install (although that wasn’t the true cause).
Never be afraid to include the obvious. The unplugged server was an actual ticket of mine and challenged both a customer and me for nearly two weeks. Until he stayed one night and saw the cleaning staff unplug the server. Sometimes it is the bleeding obvious. There are some management tools that can help with determining what the cause is but likely it will be further examination of logs that will truly ferret out what it is. Additionally, tools that are for specific “food groups” (in PSO Education many instructors use the term “four food groups” to reference CPU, memory, disk and network as the main parts of a computer, virtual or physical) may need to be employed. For example, if I’m investigating a networking issue I’ll usually get WireShark or a similar packet sniffing tool so that I can see packets as they transverse physical or virtual networks (remember to enable promiscuous mode on virtual switches).
Once we know what the problem and cause is, we can now go about a fix for it. It is usually here that someone yells, “EUREKA! I know what’s happened and how to get it to work again!” Granted, software bugs would be our domain but sometimes it can be simple configuration issues or other items that you, as an administrator, can adjust to solve the problem. Once you’ve applied the fix, you need to ensure that it is truly fixed. Some fixes are immediate while others may require waiting until a specific time has passed before you know it’s fixed (usually because it takes X day for the issue to appear).
While this whole process seems obvious and straightforward, I often still get asked on how to troubleshoot ESX/ESXi and vCenter. Sometimes being able to do some hands-on troubleshooting on a safe system can go a long way to learning how to do so on a production system. Additionally, knowing which are the right tools to determine what an issue is can be something unique in itself. And knowing how to use those tools isn’t always something that we know. This is where one of newer courses, VMware vSphere: Troubleshooting, can come in handy.
This 4-day course (taught online and in class) is primarily hands on. It forces you to use the ACE model to troubleshooting broken systems. We take through some of the tools used, have you fix a simple system and continue. The course covers the more common issues we see for storage, networking, vCenter, vMotion and other areas. (CPU and memory tend to be performance issues and are kept to the P
erformance Course). There is even a final “graduation” lab where you take what you learned – command line tools, log reading, network tools, storage tools, etc. – and put it to the final test. While it may not give you the answer to every problem, it should give you enough of a good start to be able to be proactive in your troubleshooting.
And as for poor Erno and his lost machine, if he had just virtualized it’d be easier to find. 😉
For further troubleshooting course info, see: http://mylearn.vmware.com/mgrreg/courses.cfm?ui=www_edu&a=one&id_subject=17829
Linus Bourque brief bio:
Linus is a Senior Technical Trainer with PSO Education, Americas Tech Team. He is also the lead instructor for VMware View and View Design. He lives in Los Angeles and doesn’t miss any of the snow although he does confuse students often with his funny Canadian accent.