Here’s our third log deep-dive from Nathan Small (Twitter handle: vSphereStorage). If you missed the first one, here it is. Funny how all of these problems point to a KB article that solves the issue!
History of issue
A host went unresponsive last night in vCenter and SSH fails to the host. The host was rebooted to bring it back online, and now root cause is requested.
To determine how long the server was actually unresponsive, we need to check /var/log/vmksummary. When we do this, we see the server was not logging for 3 days, which could point to a system hang:
Mar 25 04:01:17 esx04 logger: (1332666076) hb: vmk loaded, 6169963.27, 6169603.352, 43, 398348, 398348, 1260, vpxa-125984, vmware-h-101788, sfcbd-19084
Mar 25 05:01:16 esx04 logger: (1332669676) hb: vmk loaded, 6173563.07, 6173202.948, 43, 398348, 398348, 1260, vpxa-126004, vmware-h-101788, sfcbd-19044
Mar 28 10:15:20 esx04 vmkhalt: (1332947720) Starting system…
Mar 28 10:15:52 esx04 logger: (1332947752) loaded VMkernel
NOTE: If the server was gracefully shutdown or rebooted, you would see a message stating this the line before “Starting system…”
What it is interesting to note is that while the host was observed to be unresponsive in vCenter only the previous night, the system hasnt been logging for over 3 days.
At this point we should gather more information around the physical server:
Server Info gathered from ‘dmidecode’:
Manufacturer: Dell Inc.
Product Name: PowerEdge R900
Vendor: Dell Inc.
Release Date: 04/15/2010
Next, it is important to determine ESX version:
# vmware -v
VMware ESX 4.0.0 build-398348
This ESX version is prior to 4.1 and therefore the IRQ remapping issue would not apply. This means the host stopped logging due to hardware, but specifically how or why isn’t known yet. We know the host stopped logging for 3 day, so we should start with the device that communicates with the local disks, where /var/log is stored.
Since this host is a Dell server, it uses the PERC (PowerEdge RAID Controller), which is a rebranded LSI controller. The controller contains a log that is maintained on the hardware itself and can be extracted with the ‘lsi_log’ command that is available on ESX hosts.
When we run this command to pull the logs off the controller we can see a rather alarming event:
Event Sequence Number : 9658
Timestamp : 3/25/2012 ; 10:19:9
Event code : 15
Locale : Boot/Shutdown event
Class : Dead (visible after next controller boot)
Description of the event : Fatal firmware error: Line 156 in ../../raid/1078int.c
Argument Type Value: 18
Argument Type : MR_EVT_ARGS_STR
String : Line 156 in ../../raid/1078int.c
When searching our Knowledge Base for the search string “Fatal firmware error”, we find only one article, however it is a perfect match! (KB 2011987 — http://kb.vmware.com/kb/2011987). This customer will need to update the firmware of their PERC controller. To prevent outages from other hosts with the same hardware configuration, we advised that they do the same for any Dell PERC or IBM ServeRAID controllers, as these are rebranded LSI controllers.