Today we have a post from guest blogger Mike Bean, who is a support engineer at VMware in the Broomfield, Colorado office. Mike plans to provide us a series he terms “Most Wanted”.
Good morning VMware aficionados. As a VMware TSE (technical support engineer), I consider my responsibility to be to help my customers manage their networks. It’s rather ironic; if I’m doing my job-properly, fewer people need to call us!
I’d like to discuss our “Most Wanted” wall (I’m not speaking literally; we don’t actually have a wall with pictures of various glitches and posted reward money!). I intend to share some of our most common issues, with the hope that these articles may help you avoid some common mistakes. Most customers who send me service requests, are not advanced ESX users. They typically have advanced backgrounds in Windows/UX/LX, and usually would like to learn more about ESX, but haven’t had the time or the opportunity. Accordingly, I will assume a level of core competency/troubleshooting skills, and little to no ESX background. We know well the semi-haunted look of the harried sysadmin who unwillingly had ESX dropped into his or her lap with little preparation. On to today’s topic-
ESX anatomy 101:
One fairly common mistake is to assume the IP address of your host is, in fact, the IP address of ESX. Check any of the common VCP study guides, and they’ll all emphasize the point, ESX and the “service console” are not the same thing.
ESX consists of two discreet entities, the “service console”, and for want of a better term, VMkernel (ESX). The IP address of your host is in fact, the service console’s. Think of it like a linux virtual machine with a specific purpose. From the end user’s/administrator’s point of view, the service console exists to help you manage your host. (a maintenance hatch) ESXi, on the other hand, does not have a service console, but we’ll save that rabbit hole for another time. The important takeaway point here is that many ESX problems actually stem from issues on the service console.
“Help, my hosts/VM’s are disconnected!”
Borrowing a line from the venerable Douglas Adams, don’t panic. Disconnected means exactly that. Disconnected does not mean OFF. ESX communicates with virtual center through what is loosely described as, “management agents”. When you hear TSE’s talk about “management agents”, we’re really talking about 3 things: vpxa, hostd, and vpxd
Vpxa lives on the ESX host (on the service console), it communicates with vpxd. It’s mostly a listener service, and is very rarely an issue. Hostd lives on the ESX host (on the service console). This is the lion, the vast majority of “disconnects” indicate a problem with hostd, and lastly vpxd lives on your Virtual Center server. These 3 services form a communications chain, and failure of one or more of these services, tends to produce “disconnects”. The good news is, because the service console and VMkernel are separate, and VMkernel does the real work within the ESX, problems and changes can and do occur on the service console without affecting your virtual machines. This is not to say the management agents will never affect VMkernel, what I am suggesting is trust, but verify. Ping your hosts, ping your VM’s. Remote connect to them both. Try plugging your host address directly into your VSphere client. Success means the problem is probably on your virtual center server (vpxd). Failure means the problem is likely on the host (hostd, vpxa).
To troubleshoot in greater detail, it is important to understand a few things. If the GUI is “disconnected”, and we can’t issue commands to ESX through the client, we need to do it another way. Enter the Service Console.
Security admins the world over cringe at that article. As well they should – Root access is not for the timid. So it’s important to consult your organization’s security policies before permanently leaving root enabled on a service console. They can and do frequently require use of separate accounts, which can then switch-user, or “su” to root. Root is necessary however, to restart the management agents.
Restarting the management agents isn’t a catch all, but if they’ve failed or stopped, it’s a good step in the right direction.
I like to try and illustrate things with actual case examples. Here is a real-world example of employing this knowledge-
Company A had a planned outage, and discovered to their chagrin the ESX cluster wouldn’t connect to Virtual Center! We could enter the IP address of individual hosts into the VSphere client, and connect to each host just fine. That meant that hostd was running, but could not connect with vpxd on Virtual Center. Long story short, both their primary DNS servers were virtualized, and some investigation revealed they were down. The management agents are fairly DNS sensitive. Manually starting their DNS servers, and restarting the management agents on the host allowed them to reconnect to Virtual Center, and power up the rest of the hosts. Case Solved!