Product Announcements

Hardware Health Monitoring via CIM, part 3

In our last post, we looked at some SMBIOS based asset information.  In this post we'll start to explore what information is available from IPMI.

As with our previous posts, lets explore the relevant technology a little before we dive into code.  IPMI – the Intelligent Platform Management Interface, is a specification that defines how you can monitor and manage the health and hardware assets on a system using what is called a Baseboard Management Controller or BMC.  The BMC is a secondary processor/controller that runs in the system and typically has it's own power tap on the power supply, so it's live even when the main system is powered off (as long as the power supply is plugged in.)  Most modern server class x86 systems come with a BMC that supports IPMI.  The BMC is then hooked up to various sensors and components on the system and can monitor their state independent of the primary CPU and running operating system.  Some examples of the types of things BMCs may monitor are chassis intrusion sensors, temperature sensors, presence sensors for devices like hot-plug disk drives, and power supplies. Often the BMC is available on the network through a secondary IP address and can be used for remote management of the system, even if it is powered off (say to power it on – a capability which vCenter's Distributed Power Management feature leverages.)  We're not going to focus on remote IPMI in this post, but instead look at the types of information you can retrieve from the BMC via the CIM implementation within ESX.  The IPMI specification provides a lot of flexibility to system manufacturers in the type and number of sensors they implement.  As a result, the data you see on one system may not match the data you see on another, particularly if those systems are from different manufacturers.

IPMI data maps to a number of different profiles.  We'll take a look at two in this set of example code, namely DSP1009 – Sensors Profile, and DSP1015 – Power Supply Profile.  In this example, we'll also touch on another aspect of CIM technology that can be a little tricky for beginners.  The CIM schema often relies on enumerated types, or as CIM refers to them, Value Maps.  When you look at the value of a property in an instance, it may be numeric, but that maps to some well defined set of values that are captured in the class definition in the MOF.  Lets look at the HealthState property as a simple example of how this works.  Here's an excerpt from the CIM_ManagedSystemElement MOF where this property is defined.

      [Description (

          "Indicates the current health of the element. This "

          "attribute expresses the health of this element but not "

          "necessarily that of its subcomponents. The possible "

          "values are 0 to 30, where 5 means the element is "

          "entirely healthy and 30 means the element is completely "

          "non-functional. The following continuum is defined: n"

          ""Non-recoverable Error" (30) – The element has "

          "completely failed, and recovery is not possible. All "

          "functionality provided by this element has been lost. n"

          ""Critical Failure" (25) – The element is "

          "non-functional and recovery might not be possible. n"

          ""Major Failure" (20) – The element is failing. It is "

          "possible that some or all of the functionality of this "

          "component is degraded or not working. n"

          ""Minor Failure" (15) – All functionality is available "

          "but some might be degraded. n"

          ""Degraded/Warning" (10) – The element is in working "

          "order and all functionality is provided. However, the "

          "element is not working to the best of its abilities. For "

          "example, the element might not be operating at optimal "

          "performance or it might be reporting recoverable errors. n"

          ""OK" (5) – The element is fully functional and is "

          "operating within normal operational parameters and "

          "without error. n"

          ""Unknown" (0) – The implementation cannot report on "

          "HealthState at this time. n"

          "DMTF has reserved the unused portion of the continuum "

          "for additional HealthStates in the future." ),

       ValueMap { "0", "5", "10", "15", "20", "25", "30", ".." },

       Values { "Unknown", "OK", "Degraded/Warning",

          "Minor failure", "Major failure", "Critical failure",

          "Non-recoverable error", "DMTF Reserved" }]

   uint16 HealthState;

The metadata in the square brackets that precedes the property is called "Qualifiers," which are typed metadata in CIM.  This particular property has 3 qualifiers, "Description," "ValueMap," and "Values." Let's focus on the latter two.  What this pair of qualifiers tells us is that we can interpret numeric values as the logical equivalent of the English strings captured in the Values array.  These two arrays are correlated – so "0" maps to "Unknown," "5" maps to "OK," and so on.  Lets look at an example sensor using the same algorithm from the code we looked at last time.  (Hint – to try this out, just change the classname to CIM_Sensor in the previous posting's code.)

OMC_NumericSensor

                      BaseUnits = 2

                        Caption = System Board 1 Ambient Temp

              CreationClassName = OMC_NumericSensor

                 CurrentReading = 4100

                   CurrentState = Normal

                       DeviceID = 50.0.32.99

                    ElementName = System Board 1 Ambient Temp

                 EnabledDefault = 2

                   EnabledState = 2

              EnabledThresholds = [1L, 5L]

                    HealthState = 5

                     Hysteresis = 0

                 IpmiSensorType = 1

                       IsLinear = False

                    MaxReadable = 12700

                    MinReadable = -12800

              MonitoredDeviceId = 7.1

                           Name = Ambient Temp(50.0.32.99)

                 NominalReading = 4200

              OperationalStatus = [2L]

                PollingInterval = 15000000000

                 PossibleStates = [u'Lower Critical', u'Lower Fatal', u'Lower Non-Critical', u'Normal', u'Unknown', u'Upper Critical', u'Upper Fatal', u'Upper Non-Critical']

                      RateUnits = 0

                 RequestedState = 12

                     SensorType = 2

             SettableThresholds = [1L, 5L]

            SupportedThresholds = [1L, 5L]

        SystemCreationClassName = OMC_UnitaryComputerSystem

                     SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058

           TimeOfCurrentReading = 20100511100502.000000+000

           TransitioningToState = 12

                   UnitModifier = -2

            UpperThresholdFatal = 7000

      UpperThresholdNonCritical = 5500

You'll notice that HealthState is reported as "5" which if we look back at the MOF, represents the value "OK."  Now lets look at a little snippet of code that can help output these properties in a more human readable form.

# Dictionary to cache class metadata

classData = {}

def friendlyValue(client, instance, propertyName):

   global classData

   # Start out with a default empty string, in case we don't have a mapping

   mapping = ''

   if instance.classname not in classData:

      # Fetch the class metadata if we don't already have it in the cache

      classData[instance.classname] = client.GetClass(instance.classname, IncludeQualifiers=True)

   myClass = classData[instance.classname]

   # Now scan through the qualifiers to look for ValueMap/Values sets

   qualifiers = myClass.properties[propertyName].qualifiers

   if 'ValueMap' in qualifiers.keys() and 'Values' in qualifiers.keys():

      vals = qualifiers['Values'].value

      valmap = qualifiers['ValueMap'].value

      value = instance[propertyName]

      # Find the matching value and convert to the friendly string

      for i in range(0,len(valmap)-1):

         if str(valmap[i]) == str(value):

             mapping = ' ('+vals[i]+')'

             break

   return mapping

Now if we call that new routine from our printInstance routine when we display property values, we can display friendly values for these properties as well.  Here's the same instance from above with the new logic in place.

OMC_NumericSensor

                     BaseUnits = 2 (Degrees C)

                       Caption = System Board 1 Ambient Temp

             CreationClassName = OMC_NumericSensor

                CurrentReading = 4200

                  CurrentState = Normal

                      DeviceID = 50.0.32.99

                   ElementName = System Board 1 Ambient Temp

                EnabledDefault = 2 (Enabled)

                  EnabledState = 2 (Enabled)

             EnabledThresholds = [1L, 5L]

                   HealthState = 5 (OK)

                    Hysteresis = 0

                IpmiSensorType = 1

                      IsLinear = False

                   MaxReadable = 12700

                   MinReadable = -12800

             MonitoredDeviceId = 7.1

                          Name = Ambient Temp(50.0.32.99)

                NominalReading = 4200

             OperationalStatus = [2L]

               PollingInterval = 15000000000

                PossibleStates = [u'Lower Critical', u'Lower Fatal', u'Lower Non-Critical', u'Normal', u'Unknown', u'Upper Critical', u'Upper Fatal', u'Upper Non-Critical']

                     RateUnits = 0 (None)

                RequestedState = 12 (Not Applicable)

                    SensorType = 2 (Temperature)

            SettableThresholds = [1L, 5L]

           SupportedThresholds = [1L, 5L]

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058

          TimeOfCurrentReading = 20100511105752.000000+000

          TransitioningToState = 12

                  UnitModifier = -2

           UpperThresholdFatal = 7000

     UpperThresholdNonCritical = 5500

…and here's an example output of a system with two power supplies where only one is plugged in.

OMC_PowerSupply

                  Availability = 3 (Running/Full Power)

                       Caption = Power Supply 1

             CreationClassName = OMC_PowerSupply

                   Description = Power Supply 1

                      DeviceID = 10.1

                   ElementName = Power Supply 1

                EnabledDefault = 2 (Enabled)

                  EnabledState = 2 (Enabled)

                   HealthState = 5 (OK)

                          Name = Power Supply 1

             OperationalStatus = [2L]

      Range1InputFrequencyHigh = 63

       Range1InputFrequencyLow = 47

        Range1InputVoltageHigh = 26400

         Range1InputVoltageLow = 9000

        Range2InputVoltageHigh = 0

         Range2InputVoltageLow = 0

                RequestedState = 12 (Not Applicable)

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 44454c4c-5a00-1039-8058-c8c04f444431

              TotalOutputPower = 930000

          TransitioningToState = 12

OMC_PowerSupply

                  Availability = 8 (Off Line)

                       Caption = Power Supply 2

             CreationClassName = OMC_PowerSupply

                   Description = Power Supply 2

                      DeviceID = 10.2

                   ElementName = Power Supply 2

                EnabledDefault = 2 (Enabled)

                  EnabledState = 3 (Disabled)

                   HealthState = 30 (Non-recoverable error)

                          Name = Power Supply 2

             OperationalStatus = [6L, 16L]

      Range1InputFrequencyHigh = 63

       Range1InputFrequencyLow = 47

        Range1InputVoltageHigh = 26400

         Range1InputVoltageLow = 9000

        Range2InputVoltageHigh = 0

         Range2InputVoltageLow = 0

                RequestedState = 12 (Not Applicable)

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 44454c4c-5a00-1039-8058-c8c04f444431

              TotalOutputPower = 930000

          TransitioningToState = 12

Attached you'll find the complete sample which displays sensors and power supplies.