Product Announcements

Hardware Health Monitoring via CIM, part 7

 

In this installment, we'll look at one of the most complex, and yet perhaps the most powerful aspect of CIM – asynchronous events, or as they're officially named, CIM Indications.  In this post we'll talk a little about how the plumbing works to subscribe for and receive indications, and then look at a real-life event using the included example code.  With these examples, you should be able to quickly build up your own custom datacenter monitoring solutions for hardware health event detection.

The exact mechanism of subscribing for and receiving events in CIM depends on which wire protocol you're using.  In this example, since I'm using CIM XML, we'll focus on how that plumbing works.  If you use WS-Management, then there's a slightly different approach.  First, lets look at the theory behind how CIM XML based indications work.

With indications, the consumer first has to subscribe for indications and tell the server what sorts of events they're interested in.  As part of that subscription, the consumer has to tell the server where to send the indications, which we call the listener.  In the CIM XML model, the listener is really nothing more than a primitive web server that can handle HTTP POST operations, and parse and respond to CIM XML payloads.  So in essence, the client becomes the server, and the actual server becomes the client when delivering events.  When an event occurs on the server that matches the subscription, the server will open an outgoing socket to the listeners address and port number, and send the indication encoded in CIM XML over an HTTP POST operation.  This of course means this model doesn't work if you have a firewall in between the client and server without jumping through hoops setting up proxies or port forwarding somehow.

At present, pywbem does not include a generic listener, but that's OK because python does have all the primitives we need to build this up with ease.  In under 150 lines of python we can whip up a fairly functional indication listener.  I'm not going to devote space in this post describing exactly how the listener is implemented — you can read through the attached code in detail if you like.  The code is written to work either standalone, where it will just dump out the indications it receives to the terminal, or you can import it into another script and wire up callbacks that will be called when an indication is received. If you're comfortable writing your own python scripts, then leveraging this listener would look something like this:

import indicationListener

import threading

portnumber = 1234 # Pick any number 

def myCallbackRoutine(listener, instance):

  # do something interesting here with the indication instance

indicationListener.SimpleCIMXMLListener.registerCallback(myCallbackRoutine)

t = threading.Thread(target=indicationListener.runListenerForever, args=[portnumber])

t.start()

 

Once we have a listener working on a known port, then we have to subscribe.  Subscription in CIM XML requires creating instances of two endpoint classes, and an association.  I've attached another python module to this post that does all the heavy lifting for you, so all you have to do is specify some arguments and it will take care of the rest.  Again, like the listener, you can either run this as a stand-alone tool, or import it and run it within your own scripts. Continuing from the snippet of code above, if we wanted to import this as a module and use it within another script, the code might look something like this.

import indicationSubscriber

myhostname = 'xxx' # replace with local host name (name of listener)

import pywbem

classname = 'VMware_CIMHeartbeat'

client = pywbem.WBEMConnection('https://myserver', ('root', 'password'), 'root/cimv2')

indicationSubscriber.subscribe(client, 'myserver', classname, myhostname, portnumber)

 

And you should start seeing a steady stream of "heartbeats" from the ESX system.  If you call the subscribe routine with an additional "True" at the end that tells it to unsubscribe only – by default it unsubscribes first to clear out any old subscriptions, and then subscribes.  (Note: These heartbeat indications are really only meant to be a debugging/setup aid.  I wouldn't recommend using them as a real-life heartbeat, as the frequency is too high and will cause network load issues if you have a large number of hosts.)

So now that we've got the plumbing out of the way, lets explore some "real" indications on the system.  One approach is to use formal documentation to figure out what classes to subscribe for.  To do this for the VMware implemented classes, take a look at the API Reference doc, click on "All Classes" in the upper left, and then search for classes with "Indication" in the classname in the lower left.  Alternatively, you can use a live system to figure this out on the fly.  Here's a simple block of code to display all the implementation classes that might be of interest given a starting classname (it hides all the superclasses that have some derived class.)

def dumpLeafClasses(client, classname):

   list = client.EnumerateClasses(ClassName=classname, DeepInheritance=True)

   superclasses = set()

   allclasses = set()

   for theClass in list:

      allclasses.add(theClass.classname)

      if theClass.superclass is not None:

         superclasses.add(theClass.superclass)

   return sorted(allclasses – superclasses)

 

If we call that with the class 'CIM_Indication' then we'll get all the indications that are available on the system, in the given namespace. Lets try this on a live system with a client connection pointed at the 'root/cimv2' namespace:

>>> dumpLeafClasses(client, 'CIM_Indication')

[u'OMC_IpmiAlertIndication', u'VMware_CIMHeartbeat', u'VMware_ConcreteJobCreation', u'VMware_ConcreteJobDeletion', u'VMware_ConcreteJobModification', u'VMware_HHRCAlertIndication', u'VMware_KernelIPChangedIndication']

>>> 

 

We could then do a GetClass on any of these to inspect their Description qualifiers to get some more details on what they're for.  The one that we'll look at in this post is the OMC_IpmiAlertIndication, which is sent when a change is detected in the IPMI subsystem.  In the sequence below, I'll use the CLI versions, but you can do the same via scripts as described above.

First we'll start the listener (if we don't specify a port number, it picks one at random):

% ./indicationListener.py 

Listener started on port 2578

 

Then we'll subscribe for the indications (using the port number it just gave us):

% ./indicationSubscriber.py -P 2578 -H listener -c OMC_IpmiAlertIndication -s esxbox -u root -p ''

Cleaning up old registrations

Could not delete subscription CIMError(5, u'Class not found')

Could not delete handler CIMError(6, u'The requested object could not be found')

Could not delete filter CIMError(6, u'The requested object could not be found')

Creating handler

Creating filter

Creating subscription

Done – events are enabled for listener:2578

%

 

You can safely ignore the "could not delete" warnings – that just tells us we didn't have a previous subscription from this host.

Then we'll cause something to happen on the system.  On many systems (Dell for example) there's usually a chassis intrusion sensor that will detect if you open the case.  I find that to be one of the simplest indications to get to fire if you want to test to make sure you have everything wired up correctly.   (If you have redundant power supplies, unplugging one of them from the wall is another easy test scenario.) Within about 15 seconds or so (the interval at which the system scans IPMI for events), on the listener window we'll see events appear:

esxbox – – [09/Jul/2010 14:07:47] "POST /test HTTP/1.1" 200 –

OMC_IpmiAlertIndication

                     AlertType = 8

         AlertingElementFormat = 2

        AlertingManagedElement = root/cimv2:OMC_DiscreteSensor.DeviceID="115.0.32.0",CreationClassName="OMC_DiscreteSensor",SystemName="44454c4c-5100-1051-8039-b3c04f464e31",SystemCreationClassName="OMC_UnitaryComputerSystem"

         CorrelatedIndications = None

                   Description = Assert + Physical Security General Chassis intrusion

                       EventID = None

                     EventTime = 20100709160722.000000+000

          IndicationFilterName = None

          IndicationIdentifier = None

                IndicationTime = 20100709160735.766255+000

                       Message = None

              MessageArguments = None

                     MessageID = None

                OtherAlertType = None

    OtherAlertingElementFormat = None

                 OtherSeverity = None

                  OwningEntity = None

             PerceivedSeverity = 0

                 ProbableCause = None

      ProbableCauseDescription = None

                  ProviderName = RawIpmiProvider

            RecommendedActions = None

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 44454c4c-5100-1051-8039-b3c04f464e31

                      Trending = Non

 

If we close the chassis, then we'll get the corresponding event telling us the problem has been resolved

esxhost – – [09/Jul/2010 14:08:12] "POST /test HTTP/1.1" 200 –

OMC_IpmiAlertIndication

                     AlertType = 8

         AlertingElementFormat = 2

        AlertingManagedElement = root/cimv2:OMC_DiscreteSensor.DeviceID="115.0.32.0",CreationClassName="OMC_DiscreteSensor",SystemName="44454c4c-5100-1051-8039-b3c04f464e31",SystemCreationClassName="OMC_UnitaryComputerSystem"

         CorrelatedIndications = None

                   Description = Deassert + Physical Security General Chassis intrusion

                       EventID = None

                     EventTime = 20100709160747.000000+000

          IndicationFilterName = None

          IndicationIdentifier = None

                IndicationTime = 20100709160801.404661+000

                       Message = None

              MessageArguments = None

                     MessageID = None

                OtherAlertType = None

    OtherAlertingElementFormat = None

                 OtherSeverity = None

                  OwningEntity = None

             PerceivedSeverity = 0

                 ProbableCause = None

      ProbableCauseDescription = None

                  ProviderName = RawIpmiProvider

            RecommendedActions = None

       SystemCreationClassName = OMC_UnitaryComputerSystem

                    SystemName = 44454c4c-5100-1051-8039-b3c04f464e31

                      Trending = None

 

If you look closely at the output, you'll notice there are a few fields that are numeric types that have special meaning.  This listener was written so it doesn't require a connection (and therefore credentials) on the host that's the source of the indication.  If you want to pretty print the value maps, you can modify the code to follow the same approach as I described in the 3rd blog post in this series, but that will require a client connection back to at least one ESX host that has the given classes implemented.

If you've got an LSI based RAID card on your system, try subscribing for VMware_HHRCAlertIndication and simulate some failures (assuming you have a RAID level that provides redundancy, you can pull a drive and see events fire.)