Product Announcements

Hardware Health Monitoring via CIM, part 5

In our last post, we looked at the IPMI SEL and how you can use that to gather events recorded by the hardware.  In this installment, we'll take a look at a different sub-system – storage.  In particular, we'll look at internal RAID card support so you can see the health of your RAID volumes and detect when disk drives fail.

At present, there are a few different RAID cards that are instrumented via the CIM code available from VMware and our partners, and that list will continue to grow over time.  Starting in version 3.5, VMware developed a rudimentary CIM provider for LSI MegaRAID and IR cards that implements a small portion of the SNIA SMI-S Host Hardware RAID Controller Profile.  The intention of this implementation was simply to report the health of the RAID volumes on the system and very basic configuration information, so users can take preventative action should a drive or component fail, before actual data loss occurs.  Subsequent to 3.5, LSI developed their own implementation of the HHRC profile, which expands on the capabilities to include the ability to create RAID volumes, and perform other operations on the card.  HP has also implemented an HHRC provider for their Smart Array internal RAID controllers.  In this post, we'll focus on the basic read-only aspects for now, and we'll come back to some of the advanced capabilities of the LSI provider in a later post.

Before we dive into code, and look at the results on a few different systems, we need to talk a little about another CIM concept called "namespace."   In CIM, the classes and instances you access exist within a namespace.  There are two "well known" namespaces called "root/interop" and "root/cimv2" where the former is used to advertise profiles (we'll touch on that in more detail in a later post) along with other core infrastructure information, and the latter is the default implementation for the system.  In VMware's case, our primary implementations exist in root/cimv2, and if you look back at the prior examples, they all hard-coded that namespace.  When another vendor adds new functionality to the system, they typically place that implementation into a different namespace so there isn't any collision with the core system instrumentation.  There's an easy way to discover all the namespaces on the system.  Just query for the CIM_Namespace class in the root/interop namespace.  Here's a short excerpt of code that does that:

def getNamespaces(server, username, password):

   client = pywbem.WBEMConnection('https://'+server <https://%27+server> ,

                                  (username, password),

                                  'root/interop')

   list = client.EnumerateInstances('CIM_Namespace', PropertyList=['Name'])

   return set(map(lambda x: x['Name'], list))

One little trick we're using in the above code is an optional parameter to EnumerateInstances called "PropertyList" which allows you to filter the properties in the results to speed things up.  Since we know we only care about the Name property, we can skip all the other properties so they don't have to be sent over the network.  Now if you run this on a ESXi 4.0 system, you might see something like the following:  (output will vary depending on what add-on VIBs you've installed on your system.)

>>> getNamespaces(server, username, password)

set([u'elxhbacmpi/cimv2', u'interop', u'root/hpq', u'root/cimv2', u'vmware/esxv2', u'lsi/lsimr12', u'root/config', u'root/interop', u'qlogic/cimv2'])

>>> 

The primary class that we're going to look at which carries the most important RAID volume information is CIM_StorageVolume.  The VMware developed LSI MR/IR implementation exists within the root/cimv2 namespace, while the LSI developed provider exists in an LSI specific, versioned namespace (lsi/lsimr12 in the example system above) and the HP implementation exists in root/hpq.  Lets look at example output, first on an LSI based system using the VMware implementation:

VMware_HHRCStorageVolume

                     BlockSize = 512

                       Caption = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,1e8)  – OPTIMAL

                      CardType = 2

              ConsumableBlocks = 140623872

             CreationClassName = VMware_HHRCStorageVolume

                DataRedundancy = 1

              DeltaReservation = 100

                      DeviceID = vmwStorageVolume0_0

                   ElementName = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,1e8)  – OPTIMAL

                EnabledDefault = 2 (Enabled)

                  EnabledState = 2 (Enabled)

                  ExtentStatus = [2L]

                   HealthState = 5 (OK)

 IsBasedOnUnderlyingRedundancy = True

                   IsComposite = True

                          Name = 60 06 05 b0 00 8f d8 e0

                    NameFormat = 2 (VPD83NAA6)

                 NameNamespace = 2 (VPD83Type3)

        NoSinglePointOfFailure = True

                NumberOfBlocks = 140623872

             OperationalStatus = [2L]

                    Primordial = False

                RequestedState = 11 (Reset)

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058

          TransitioningToState = 12

Now lets look at an example output from the HP implementation on another system:

SMX_SAStorageVolume

                   Accelerator = 0 (Unknown)

                     BlockSize = 512

              ConsumableBlocks = 0

             CreationClassName = SMX_SAStorageVolume

                DataRedundancy = 2

                    Deleteable = True

              DeltaReservation = 0

                      DeviceID = SMX_SAStorageVolume-600508b1001031353120202020200003

                   ElementName = Logical Volume 0 (RAID 1)

                EnabledDefault = 2 (Enabled)

                  EnabledState = 5 (Not Applicable)

                  ExtentStatus = [2L]

                FaultTolerance = 2 (RAID 1)

                   HealthState = 5 (OK)

 IsBasedOnUnderlyingRedundancy = False

                   IsComposite = True

                          Name = 600508b1001031353120202020200003

                    NameFormat = 2 (VPD83NAA6)

        NoSinglePointOfFailure = True

                NumberOfBlocks = 143305920

                        OSName = Unknown

             OperationalStatus = [2L]

             PackageRedundancy = 1

                    Primordial = False

                RequestedState = 12 (Not Applicable)

            StatusDescriptions = [u'Logical Disk is operating properly']

                    StripeSize = 131072

       SystemCreationClassName = SMX_SAArraySystem

                    SystemName = QT79MU0151

          TransitioningToState = 12

To make it interesting, let's pull a drive from the RAID volumes on these two systems to simulate a drive failure, and see what the results look like:

VMware_HHRCStorageVolume

                     BlockSize = 512

                       Caption = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,?)  – DEGRADED

                      CardType = 2

              ConsumableBlocks = 140623872

             CreationClassName = VMware_HHRCStorageVolume

                DataRedundancy = 2

              DeltaReservation = 100

                      DeviceID = vmwStorageVolume0_0

                   ElementName = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,?)  – DEGRADED

                EnabledDefault = 2 (Enabled)

                  EnabledState = 2 (Enabled)

                  ExtentStatus = [3L]

                   HealthState = 10 (Degraded/Warning)

 IsBasedOnUnderlyingRedundancy = True

                   IsComposite = True

                          Name = 60 06 05 b0 00 8f d8 e0

                    NameFormat = 2 (VPD83NAA6)

                 NameNamespace = 2 (VPD83Type3)

        NoSinglePointOfFailure = True

                NumberOfBlocks = 140623872

             OperationalStatus = [3L]

                    Primordial = False

                RequestedState = 11 (Reset)

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058

          TransitioningToState = 12

SMX_SAStorageVolume

                   Accelerator = 1 (Enabled)

                     BlockSize = 512

              ConsumableBlocks = 0

             CreationClassName = SMX_SAStorageVolume

                DataRedundancy = 2

                    Deleteable = True

              DeltaReservation = 0

                      DeviceID = SMX_SAStorageVolume-600508b1001031353120202020200003

                   ElementName = Logical Volume 0 (RAID 1)

                EnabledDefault = 2 (Enabled)

                  EnabledState = 5 (Not Applicable)

                  ExtentStatus = [11L]

                FaultTolerance = 2 (RAID 1)

                   HealthState = 10 (Degraded/Warning)

 IsBasedOnUnderlyingRedundancy = False

                   IsComposite = True

                          Name = 600508b1001031353120202020200003

                    NameFormat = 2 (VPD83NAA6)

        NoSinglePointOfFailure = True

                NumberOfBlocks = 143305920

                        OSName = Unknown

             OperationalStatus = [3L, 32772L]

             PackageRedundancy = 1

                    Primordial = False

                RequestedState = 12 (Not Applicable)

            StatusDescriptions = [u'Logical Disk is degraded', u'Logical Disk Extended Status']

                    StripeSize = 131072

       SystemCreationClassName = SMX_SAArraySystem

                    SystemName = QT79MU0151

          TransitioningToState = 12

You'll notice that the HealthState of both shows 10 (degraded/warning), which is the best field to monitor if you wanted to script something on top of this.

The complete script used for this post is attached here.