Ops changes part 6 - Quick troubleshooting tips

When ever an issue arises on an ESX host people often rush to the Service Console and start typing various commands to figure out what is wrong. Of course some of the commands are not available with ESXi and you might not even have access to the ESXi console. Many will resort to the vCLI/vMA or even PowerCLI and that works perfectly fine. Especially the vCLI/vMA is geared towards those who have experience with ESX command-line troubleshooting. You will have all "esxcfg-*" commands to your disposal and of course resxtop; which will cover 95% of those cases where commandline details are required.

I want to stress that ESXi was not built for console access, although we do provide access to the console and it works fine. The idea around ESXi is to have a lean hypervisor which is managed from the outside versus the inside and VMware has provided multiple tools to do so. The first and foremost being of course vCenter Server or the vSphere Client. Many problems can be solved simply by using the vSphere Client connected to a host directly or through vCenter Server itself. The first KB article in the list below is a good example of how vCenter can be used to troubleshoot an inaccessible virtual machine. Something that people tend to forget is that the vSphere Client can also be used to read log files, there is no need to open up a console session for that shown below and explained in the second article in the list:

Open a browser and enter the URL http://<vCenter hostname>, where <vCenter hostname> is the IP or fully qualified domain name for the vCenter Server.
Provide administrative credentials when prompted.
Click the Browse datastores in the vCenter inventory link.
Navigate the webpages until you reach the appropriate datacenter, datastore, and folder as noted in step 1.
Click the link to the appropriate log file, and open it with your preferred editor.

On the topic of log files, for those who never worked with ESXi the location is slightly different than you are used to:

The VMkernel, vmkwarning, and hostd logs are located at /var/log/messages
The Host Management service (hostd = Host daemon) log is located at /var/log/vmware/hostd.log
The vCenter Agent log is located at /var/log/vmware/vpx/vpxa.log
The System boot log is located at /var/log/sysboot.log
The Automatic Availability Manager (AAM) logs are located at /var/log/vmware/aam/vmware_<hostname>-xxx.log

Note that the /var/log/messages is a combination of all logs out there except for the HA log. You will need to monitor open that up seperately when troubleshooting HA related issues. Also be noted that the HA logfiles aren't part of the Syslog mechanism either unfortunately. Knowing the logfiles and the type of info you can get from it is key when troubleshooting. I encourage everyone to get familiar with it when you have the time to do so, as under pressure you don't want to find yourself fiddling around in the wrong location or logfile when you have 4 managers and your director watching over your shoulder if you have fixed it already or not.

After you have dived into the log files make sure you check the Knowledge Base. Our Knowledge Base has an excellent set of articles which can be used to troubleshoot very specific issues or at least lead you into the right direction. I have listed some of the most common issues and used KB's including a link to the article below for your convenience:

Restart the management agents on an ESXi host (1003490)
Determining why a single virtual machine is inaccessible (1018834).
Determining why a virtual machine was powered off or restarted (1019064).
Determining why multiple virtual machines are inaccessible (1019000).
Troubleshooting virtual machine network connection issues (1003893).
Interpreting virtual machine monitor and executable failures (1019471).
Determining why a virtual machine does not respond to user interaction at the console (1017926).
Using Tech Support Mode in ESXi 4.1 (1017910)
Determining why a VMware ESXi host is inaccessible (1019082)
Determining why a VMware ESXi host was powered off or restarted (1019238).
Determining why a VMware ESXi host does not respond to user interaction (1017135).
Enabling serial-line logging for an ESXi host (1003900).
Using performance collection tools to gather data for fault analysis (1006797).
Using hardware NMI facilities to troubleshoot unresponsive hosts (1014767)
Interpreting a VMware ESX host purple diagnostic screen (1004250).
Troubleshooting VMware High Availability (HA) (1001596).

In some cases however it might be required or desirable to log in to Technical Support Mode (yes this is fully supported) and work directly from the ESXi shell. The ESXi shell as many of you know also contains all esxcfg-* commands, the invaluable esxcli command and of course some shell commands that are required for troubleshooting. Some of those commands are obvious, others are less obvious. I have listed several below to make things easier.

The one many complained about in the past but actually is available is vmkping. Vmkping can be used to do basic network troubleshooting, but also for instance to validate if jumbo frames can be used by simply adding the size of the packet:

vmkping -s 9000 <ipaddress>

One that many bumped into in the ESXi 4.0 time frame was the lack of a mount command. This mount command was actually available, but as part of busybox:

/usr/bin/busybox mount

In 4.1 though the "mount" command has been linked to busybox itself enabling you to just use "mount. The same applies to for instance fdisk. Fdisk will enable you to validate the partition setup. It has helped me many times in the past to validate that partitions were still marked as "VMFS" when someone accidentally presented VMFS volumes to Windows machines which immediately resignatured the disks. Again under 4.0 fdisk is not available as a binary but is available through "busybox", and in 4.1 is available as a link. (Most of these links are located in /usr/sbin)

/usr/bin/busybox fdisk -l

Another thing that I have done in the past regularly when I needed to evacuate a host is place the host in maintenance mode. With ESXi you can do this as follows:

vim-cmd hostsvc/maintenance_mode_enter

And of course you can also exit maintenance mode:

vim-cmd hostsvc/maintenance_mode_exit

What about listing all VMs and stopping a specific one?

vim-cmd vmsvc/getallvms

vim-cmd vmsvc/poweroff <vm id>

These are just examples to show the power of vim-cmd. Many try to avoid using it, but really it is not overly complex and it gets the job done fairly simple. It can be difficult sometimes to figure out the syntax but than again if you can't find figure it someone else probably has, google it.

Something that I was asked about this week which can also come in handy when troubleshooting memory issues is the following command which will give you the memory utilization of the hypervisor components:

vdf -ph

These commands are just a couple of examples of what is possible within the ESXi shell, although we do generally recommend to avoid logging in to the ESXi shell (via remote or local tech support mode) and prefer to use the alternatives we offer it will work fine. In general troubleshooting hasn't changed much due to the full support of the "Technical Support Mode" feature, the remote command line utilities (vCLI or the vMA) and of course vCenter or the vSphere Client.

Ops changes part 6 – Quick troubleshooting tips

Related Posts:

Related Posts:

Related Articles

VMware vSphere Foundation: Optimizing Private Clouds and Driving IT Value

Embracing Change with VMware vSphere Foundation

Announcing New Collaborations in VMware Private AI