Here is a guest post from one of our Tech Support Engineers and Knowledge Contributor:
My name is Daniel and I have been working on improving the supportability of ESXi 5.0 release since late 2009.
I am part of a fantastic team at VMware global support services that is focused on ensuring future VMware products are supportable at release time. Back when I joined this team, ESXi 5.0 was still in the early stages of development and it looked nothing like the product you see today. Since then, we've worked with the core engineering and product management teams to ensure that ESXi 5.0 was the most supportable ESXi release to date. It has been a long time coming, but I finally get a chance to write about some of the changes that we were able to implement that make ESXi 5.0 a great product to support.
Since our organization is focused completely on troubleshooting, my main focus was been to target changes that would improve the way we review and collect logs and core dumps from our products. Many of these changes/features do not have the same “wow” factor as a major feature like Storage DRS (link http://www.vmware.com/products/datacenter-virtualization/vsphere/vsphere-storage-drs/features.html), but they will go a long way in troubleshooting the products. Although I hope you do not encounter any vSphere 5.0 issues while in production, these changes will improve your support experience if you do encounter a problem.
Separate log files – One of the challenges that we identified when supporting ESXi was the way in which ESXi logged events. Many of the events generated by ESXi 4.x were placed in one file (/var/log/messages). Since many different components were sharing the same file, the log files rotated at an accelerated rate. Although most of these issues could be mitigated by implementing a Syslog server with longer retention policies, we wanted to ensure that the ESXi product retained a sufficient amount of logs out of the box. Ensuring that we retain as much information as possible increases the changes that we will capture more logs, identify more patterns, and solve more issues. In addition to seperating all of the logs, we have also introduced some new logs that were previously available only in ESX classic. We re-introduced the vmksummary.log and vmkwarning.log log files, which help to provide an overview of system wide outages. Overall, these changes will ensure we have more of the right information in the logs to help solve customer issues. For a list and description of the logs files included in ESXi 5.0, please refer to KB 2004201 (link http://kb.vmware.com/kb/2004201).
Logging Target and Rotation Control – vSphere 5.0 has greatly improved the esxcli tool, and it now includes a namespace dedicated to log management. The namespace is “esxcli system syslog” and it has commands that display and change the logging size, rotatation, and destination. This provides increased granularity of control for retention policies of logs which can be executed locally or remotely. These various logging options can be captured and replicated across multiple hosts using Host Profiles, so that to ensure that all hosts behave the consistently in terms of logging. For more information, please refer to the Configuring ESXi Syslog Services section of the vSphere 5.0 Command line Interface documentation (link http://pubs.vmware.com/vsphere-50/topic/com.vmware.vcli.getstart.doc_50/cli_about.html).
Date and Time Stamps – At a high level, date and time stamps of logs do not impact how the product behaves or performs. However, as part of a team that looks at logs every day, it can be a bit of a headache. As far back as I can remember, ESX and ESXi have been comprised of many different components that use different date and time conventions for logging. Over the years I’ve become accustomed to the differences, but it started to become a challenge particularly when trying to automate some of the analysis (For example, making up for the fact that the vmkernel logs did not have a year). In order to identify these gaps, we pushed to standardize the date and time for all of the major logs.
Here is a sample of the vmkernel logs with the new date and time stamp:
2011-08-17T18:27:10.945Z cpu3:437397)VMMVMKCall: 194: Received INIT from world 437397
Here is a sample of the hostd logs with the new date and time stamp:
2011-08-24T20:43:55.495Z [74881B90 info 'VmwareCLI'] Dispatch mark
Make note that the date and time stamps are consistent across both log files. Both logs implement the ISO 8601 standard. For more information, see the ISO date standards webpage (http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/date_and_time_format.htm).
Landmarks in logs – One of the challenges of troubleshooting is identifying exactly when an issue occurs. This becomes a bit more challenging when a customer and support engineer are in different timezones, both of which may differ from the UTC time used by the ESXi host. Because of the importance of identifing events which occur during normal operation and which only occur during the issue or error, it is now possible to add markers in the logs. The esxcli command includes a command called “esxcli system syslog mark” that can add a single, log-line with user-defined text to multiple ESXi host logs at once. You can generate an event that says “start operation”, then immediately perform the operation that causes the issue. Once the issue occurs, you can generated another event that says, for example “stop operation”. This will allow the engineer to know exactly when the operation started and ended. Here is a small sample of the feature in use:
2011-08-24T20:43:50Z mark: begin timeout test (SR # 123456789)
<other events>
2011-08-24T20:43:55Z mark: end timeout test (SR # 123456789)
Once the logs are provided to the technical support engineer, they will be able to narrow their focus to the exact times indicated by the events generated by the “mark” command. Also, you only have to run these commands once, and the mark will appear across all of the ESXi log files. For more information, see the esxcli system Command section of the vSphere Command-Line Interface reference (link http://pubs.vmware.com/vsphere-50/topic/com.vmware.vcli.ref.doc_50/esxcli_system.html)
Remote core dump and log destinations – vSphere 5.0 includes the vSphere Syslog Collector as well as the vSphere ESXi Dump Collector which provide a central place to store all of your logs and core dumps. These were a necessity for implementing stateless and diskless ESXi hosts using Autodeploy, but the changes that came with it have really simplified and standardized the steps required to setup sy
slog and core dump locations.
For more information on setting up an ESXi host for remote core dumps, see the Configure ESXi Dump Collector with ESXCLI section of the vSphere installation and setup guide (link http://pubs.vmware.com/vsphere-50/topic/com.vmware.vsphere.install.doc_50/GUID-85D78165-E590-42CF-80AC-E78CBA307232.html) For more information, please refer to the Configuring ESXi Syslog Services section of the vSphere Command line Interface documentation (link http://pubs.vmware.com/vsphere-50/topic/com.vmware.vcli.examples.doc_50/cli_performance.12.5.html).
Command-line utilities and user interface enhancements for collecting support bundles– The vm-support script has been completely re-written from ground up. The script now allows a user to initiate a performance gathering support bundle from the GUI. Also, there is an entirely new manifest system that allows a user to pick and choose a subset of manifests to include in the support bundle. Many of the options are not documented, but the usage information is available by running the vm-support –help command. For more information, see (KB 653 and 1010705).
In the midst of a major product release with big features such as Autodeploy and Storage DRS, these improvements may not receive as much love. You may not see these features documented in the What’s New pages, the Datasheets, or other Marketing materials that usually go hand in hand with product launches, but rest assured that they they are in the product. Although we do our best to ship a flawless product, issues and support are an inevitable part of any software release. Every feature that we've added is aimed at ensuring that we continue to provide the world-class customer service and technical support that VMware is known for.
– Daniel
Comments