In our last post, we looked at the IPMI SEL and how you can use that to gather events recorded by the hardware. In this installment, we'll take a look at a different sub-system – storage. In particular, we'll look at internal RAID card support so you can see the health of your RAID volumes and detect when disk drives fail.
At present, there are a few different RAID cards that are instrumented via the CIM code available from VMware and our partners, and that list will continue to grow over time. Starting in version 3.5, VMware developed a rudimentary CIM provider for LSI MegaRAID and IR cards that implements a small portion of the SNIA SMI-S Host Hardware RAID Controller Profile. The intention of this implementation was simply to report the health of the RAID volumes on the system and very basic configuration information, so users can take preventative action should a drive or component fail, before actual data loss occurs. Subsequent to 3.5, LSI developed their own implementation of the HHRC profile, which expands on the capabilities to include the ability to create RAID volumes, and perform other operations on the card. HP has also implemented an HHRC provider for their Smart Array internal RAID controllers. In this post, we'll focus on the basic read-only aspects for now, and we'll come back to some of the advanced capabilities of the LSI provider in a later post.
Before we dive into code, and look at the results on a few different systems, we need to talk a little about another CIM concept called "namespace." In CIM, the classes and instances you access exist within a namespace. There are two "well known" namespaces called "root/interop" and "root/cimv2" where the former is used to advertise profiles (we'll touch on that in more detail in a later post) along with other core infrastructure information, and the latter is the default implementation for the system. In VMware's case, our primary implementations exist in root/cimv2, and if you look back at the prior examples, they all hard-coded that namespace. When another vendor adds new functionality to the system, they typically place that implementation into a different namespace so there isn't any collision with the core system instrumentation. There's an easy way to discover all the namespaces on the system. Just query for the CIM_Namespace class in the root/interop namespace. Here's a short excerpt of code that does that:
def getNamespaces(server, username, password):
client = pywbem.WBEMConnection('https://'+server <https://%27+server> ,
(username, password),
'root/interop')
list = client.EnumerateInstances('CIM_Namespace', PropertyList=['Name'])
return set(map(lambda x: x['Name'], list))
One little trick we're using in the above code is an optional parameter to EnumerateInstances called "PropertyList" which allows you to filter the properties in the results to speed things up. Since we know we only care about the Name property, we can skip all the other properties so they don't have to be sent over the network. Now if you run this on a ESXi 4.0 system, you might see something like the following: (output will vary depending on what add-on VIBs you've installed on your system.)
>>> getNamespaces(server, username, password)
set([u'elxhbacmpi/cimv2', u'interop', u'root/hpq', u'root/cimv2', u'vmware/esxv2', u'lsi/lsimr12', u'root/config', u'root/interop', u'qlogic/cimv2'])
>>>
The primary class that we're going to look at which carries the most important RAID volume information is CIM_StorageVolume. The VMware developed LSI MR/IR implementation exists within the root/cimv2 namespace, while the LSI developed provider exists in an LSI specific, versioned namespace (lsi/lsimr12 in the example system above) and the HP implementation exists in root/hpq. Lets look at example output, first on an LSI based system using the VMware implementation:
VMware_HHRCStorageVolume
BlockSize = 512
Caption = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,1e8) – OPTIMAL
CardType = 2
ConsumableBlocks = 140623872
CreationClassName = VMware_HHRCStorageVolume
DataRedundancy = 1
DeltaReservation = 100
DeviceID = vmwStorageVolume0_0
ElementName = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,1e8) – OPTIMAL
EnabledDefault = 2 (Enabled)
EnabledState = 2 (Enabled)
ExtentStatus = [2L]
HealthState = 5 (OK)
IsBasedOnUnderlyingRedundancy = True
IsComposite = True
Name = 60 06 05 b0 00 8f d8 e0
NameFormat = 2 (VPD83NAA6)
NameNamespace = 2 (VPD83Type3)
NoSinglePointOfFailure = True
NumberOfBlocks = 140623872
OperationalStatus = [2L]
Primordial = False
RequestedState = 11 (Reset)
SystemCreationClassName = OMC_UnitaryComputerSystem
SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058
TransitioningToState = 12
Now lets look at an example output from the HP implementation on another system:
SMX_SAStorageVolume
Accelerator = 0 (Unknown)
BlockSize = 512
ConsumableBlocks = 0
CreationClassName = SMX_SAStorageVolume
DataRedundancy = 2
Deleteable = True
DeltaReservation = 0
DeviceID = SMX_SAStorageVolume-600508b1001031353120202020200003
ElementName = Logical Volume 0 (RAID 1)
EnabledDefault = 2 (Enabled)
EnabledState = 5 (Not Applicable)
ExtentStatus = [2L]
FaultTolerance = 2 (RAID 1)
HealthState = 5 (OK)
IsBasedOnUnderlyingRedundancy = False
IsComposite = True
Name = 600508b1001031353120202020200003
NameFormat = 2 (VPD83NAA6)
NoSinglePointOfFailure = True
NumberOfBlocks = 143305920
OSName = Unknown
OperationalStatus = [2L]
PackageRedundancy = 1
Primordial = False
RequestedState = 12 (Not Applicable)
StatusDescriptions = [u'Logical Disk is operating properly']
StripeSize = 131072
SystemCreationClassName = SMX_SAArraySystem
SystemName = QT79MU0151
TransitioningToState = 12
To make it interesting, let's pull a drive from the RAID volumes on these two systems to simulate a drive failure, and see what the results look like:
VMware_HHRCStorageVolume
BlockSize = 512
Caption = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,?) – DEGRADED
CardType = 2
ConsumableBlocks = 140623872
CreationClassName = VMware_HHRCStorageVolume
DataRedundancy = 2
DeltaReservation = 100
DeviceID = vmwStorageVolume0_0
ElementName = RAID 1 Logical Volume 0 on controller 0, Drives(0e8,?) – DEGRADED
EnabledDefault = 2 (Enabled)
EnabledState = 2 (Enabled)
ExtentStatus = [3L]
HealthState = 10 (Degraded/Warning)
IsBasedOnUnderlyingRedundancy = True
IsComposite = True
Name = 60 06 05 b0 00 8f d8 e0
NameFormat = 2 (VPD83NAA6)
NameNamespace = 2 (VPD83Type3)
NoSinglePointOfFailure = True
NumberOfBlocks = 140623872
OperationalStatus = [3L]
Primordial = False
RequestedState = 11 (Reset)
SystemCreationClassName = OMC_UnitaryComputerSystem
SystemName = 602cfd62-cdef-3922-bf8d-6c4a0cf42058
TransitioningToState = 12
SMX_SAStorageVolume
Accelerator = 1 (Enabled)
BlockSize = 512
ConsumableBlocks = 0
CreationClassName = SMX_SAStorageVolume
DataRedundancy = 2
Deleteable = True
DeltaReservation = 0
DeviceID = SMX_SAStorageVolume-600508b1001031353120202020200003
ElementName = Logical Volume 0 (RAID 1)
EnabledDefault = 2 (Enabled)
EnabledState = 5 (Not Applicable)
ExtentStatus = [11L]
FaultTolerance = 2 (RAID 1)
HealthState = 10 (Degraded/Warning)
IsBasedOnUnderlyingRedundancy = False
IsComposite = True
Name = 600508b1001031353120202020200003
NameFormat = 2 (VPD83NAA6)
NoSinglePointOfFailure = True
NumberOfBlocks = 143305920
OSName = Unknown
OperationalStatus = [3L, 32772L]
PackageRedundancy = 1
Primordial = False
RequestedState = 12 (Not Applicable)
StatusDescriptions = [u'Logical Disk is degraded', u'Logical Disk Extended Status']
StripeSize = 131072
SystemCreationClassName = SMX_SAArraySystem
SystemName = QT79MU0151
TransitioningToState = 12
You'll notice that the HealthState of both shows 10 (degraded/warning), which is the best field to monitor if you wanted to script something on top of this.
The complete script used for this post is attached here.