Product Announcements

Hardware Health Monitoring via CIM, part 4

In our last post, we started to dive deeper into writing CIM client code, and looked at some of the IPMI based data.  In this post, we'll explore a little more detail on some of the log data that comes from IPMI and touch on a few new CIM concepts.

One feature of IPMI that is quite useful in diagnosing and ultimately recovering from hardware faults is a component called the System Event Log, or SEL.  The SEL is typically implemented as a fixed length buffer which will accumulate entries until it fills up, and then stop recording until someone clears the SEL.  The theory goes that when you have a cascade of failures, the earlier failures are more pertinent, and later failures are likely just ripple affects from the earlier failures, so if the older entries were overwritten, you would have lost the most important entries.  The down-side to this model is that if you never clear the SEL, over time, cruft can accumulate and when you actually do have a major problem, the system will be unable to record log entries.

First lets take a look at an instance of the class that represents the SEL itself (not the entries in the SEL) using the same basic code from our prior examples:


OMC_IpmiRecordLog

                  AddTimeStamp = 20100507123553.000000+000

                       Caption = IPMI SEL

        CurrentNumberOfRecords = 307

                   ElementName = IPMI SEL

                EnabledDefault = 2 (Enabled)

                  EnabledState = 2 (Enabled)

                EraseTimeStamp = 20080807211343.000000+000

                         Flags = 2

                   HealthState = 5 (OK)

                    InstanceID = IPMI:vmware-host SEL Log (Node 0)

                      LogState = 4 (Not Applicable)

            MaxNumberOfRecords = 512

            MemoryHealthStatus = 5

       MemoryOperationalStatus = 2

                          Name = IPMI SEL Log

             OperationalStatus = [2L]

                  OverFlowFlag = False

               OverwritePolicy = 7 (Never Overwrites)

                RequestedState = 12 (Not Applicable)

          TransitioningToState = 12

                       Version = 81

You can see that in this case, this particular system has accumulated 307 entries (CurrentNumberOfRecords) and has a total capacity of 512 (MaxNumberOfRecords) so we're not too far away from filling up the log.  Here are a few examples of some of the entries in the SEL:


OMC_IpmiLogRecord

                       Caption = Assert + Power Unit Power off/down

             CreationClassName = OMC_IpmiLogRecord

                   Description = Assert + Power Unit Power off/down

                   ElementName = IPMI SEL

          LogCreationClassName = OMC_IpmiRecordLog

                       LogName = IPMI SEL

              MessageTimestamp = 00000000000024.000000:000

                    RecordData = *1.0.32*47 1*2*24 0 0 0*32 0*4*9*1*false*111*0*255*255*1*

                  RecordFormat = *string CIM_Sensor.DeviceID*uint8[2] IPMI_RecordID*uint8 IPMI_RecordType*uint8[4] IPMI_Timestamp*uint8[2] IPMI_GeneratorID*uint8 IPMI_EvMRev*uint8 IPMI_SensorType*uint8 IPMI_SensorNumber*boolean IPMI_AssertionEvent*uint8 IPMI_EventType*uint8 IPMI_EventData1*uint8 IPMI_EventData2*uint8 IPMI_EventData3*uint32 IANA*

                      RecordID = 303

OMC_IpmiLogRecord

                       Caption = Deassert + Power Unit Power off/down

             CreationClassName = OMC_IpmiLogRecord

                   Description = Deassert + Power Unit Power off/down

                   ElementName = IPMI SEL

          LogCreationClassName = OMC_IpmiRecordLog

                       LogName = IPMI SEL

              MessageTimestamp = 00000000000217.000000:000

                    RecordData = *1.0.32*48 1*2*137 0 0 0*32 0*4*9*1*true*111*0*255*255*1*

                  RecordFormat = *string CIM_Sensor.DeviceID*uint8[2] IPMI_RecordID*uint8 IPMI_RecordType*uint8[4] IPMI_Timestamp*uint8[2] IPMI_GeneratorID*uint8 IPMI_EvMRev*uint8 IPMI_SensorType*uint8 IPMI_SensorNumber*boolean IPMI_AssertionEvent*uint8 IPMI_EventType*uint8 IPMI_EventData1*uint8 IPMI_EventData2*uint8 IPMI_EventData3*uint32 IANA*

                      RecordID = 304

The Description field is a somewhat human readable description of what the entry represents, and the RecordData, and RecordFormat are the "raw" data from the SEL entry.  One other interesting field to look at is the MessageTimeStamp.  CIM has two forms of representing time – one is "relative" and one is "absolute."  If the separator at the end of the string is a ":" then it is a relative (interval) time.  If the separator is "+" or "-" then the last 3 digits are a UTC offset.  In the case of these two instances, they're both relative, which means they were logged after the BMC was supplied with power, but before the BIOS programmed the BMC with a date (so before POST.)

So how do we go about clearing the SEL?  Up to this point, we've only looked at some of the "read-only" aspects of CIM.  To do an operation like clearing the SEL we have to learn how to invoke methods on objects.  But first we need to talk about the difference between an instance, and a reference to that instance, also known as an Object Path. In CIM, when you do an operation like "EnumerateInstances" the results you get back are complete instances.  They have all the properties filled out for the object.  CIM also has a concept called an Object Path, which is a reference to an instance.  It's just the "key" properties, and their values that refer to the instance in question along with the classname, namespace and hostname.  If you're a programmer, you can think of this as a kind of "remote pointer."  If you want to think about it in database terms, the instance is like the full row of the table, and the object path is just the key columns of the row.  The idea is that when you're doing operations or other tasks where you just need to unambiguously refer to a specific thing, you don't want to pass around all the properties, just the bare essentials to be unique. Here's some code that will perform the SEL clearing operation which takes advantage of a pywbem trick to get the object path from an existing instance.


def doClearSEL(client, selInstance):

   # InvokeMethod operates on "object paths" not instances

   selPath = selInstance.path

   try:

      print 'Clearing SEL for %s' % (selPath['InstanceID'])

      (retval, outparams) = client.InvokeMethod('ClearLog', selPath)

      retval = 0

      if retval == 0:

         print 'Completed'

      elif retval == 1:

         print 'Not supported'

      else:

         print 'Error: ' + retval

   except pywbem.CIMError, arg:

      print 'Exception: ' + arg[1]

Most of the code is just error handling and reporting.  We're invoking the method called "ClearLog" on the SELs object path.

The 
attached example puts this all together in one small tool that can dump the SEL information, or clear it if you pass the optional "-c" argument.