Hardware Health Monitoring via CIM, part 7

In this installment, we'll look at one of the most complex, and yet perhaps the most powerful aspect of CIM – asynchronous events, or as they're officially named, CIM Indications. In this post we'll talk a little about how the plumbing works to subscribe for and receive indications, and then look at a real-life event using the included example code. With these examples, you should be able to quickly build up your own custom datacenter monitoring solutions for hardware health event detection.

The exact mechanism of subscribing for and receiving events in CIM depends on which wire protocol you're using. In this example, since I'm using CIM XML, we'll focus on how that plumbing works. If you use WS-Management, then there's a slightly different approach. First, lets look at the theory behind how CIM XML based indications work.

With indications, the consumer first has to subscribe for indications and tell the server what sorts of events they're interested in. As part of that subscription, the consumer has to tell the server where to send the indications, which we call the listener. In the CIM XML model, the listener is really nothing more than a primitive web server that can handle HTTP POST operations, and parse and respond to CIM XML payloads. So in essence, the client becomes the server, and the actual server becomes the client when delivering events. When an event occurs on the server that matches the subscription, the server will open an outgoing socket to the listeners address and port number, and send the indication encoded in CIM XML over an HTTP POST operation. This of course means this model doesn't work if you have a firewall in between the client and server without jumping through hoops setting up proxies or port forwarding somehow.

At present, pywbem does not include a generic listener, but that's OK because python does have all the primitives we need to build this up with ease. In under 150 lines of python we can whip up a fairly functional indication listener. I'm not going to devote space in this post describing exactly how the listener is implemented — you can read through the attached code in detail if you like. The code is written to work either standalone, where it will just dump out the indications it receives to the terminal, or you can import it into another script and wire up callbacks that will be called when an indication is received. If you're comfortable writing your own python scripts, then leveraging this listener would look something like this:

import indicationListener

import threading

portnumber = 1234 # Pick any number

def myCallbackRoutine(listener, instance):

# do something interesting here with the indication instance

indicationListener.SimpleCIMXMLListener.registerCallback(myCallbackRoutine)

t = threading.Thread(target=indicationListener.runListenerForever, args=[portnumber])

t.start()

Once we have a listener working on a known port, then we have to subscribe. Subscription in CIM XML requires creating instances of two endpoint classes, and an association. I've attached another python module to this post that does all the heavy lifting for you, so all you have to do is specify some arguments and it will take care of the rest. Again, like the listener, you can either run this as a stand-alone tool, or import it and run it within your own scripts. Continuing from the snippet of code above, if we wanted to import this as a module and use it within another script, the code might look something like this.

import indicationSubscriber

myhostname = 'xxx' # replace with local host name (name of listener)

import pywbem

classname = 'VMware_CIMHeartbeat'

client = pywbem.WBEMConnection('https://myserver', ('root', 'password'), 'root/cimv2')

indicationSubscriber.subscribe(client, 'myserver', classname, myhostname, portnumber)

And you should start seeing a steady stream of "heartbeats" from the ESX system. If you call the subscribe routine with an additional "True" at the end that tells it to unsubscribe only – by default it unsubscribes first to clear out any old subscriptions, and then subscribes. (Note: These heartbeat indications are really only meant to be a debugging/setup aid. I wouldn't recommend using them as a real-life heartbeat, as the frequency is too high and will cause network load issues if you have a large number of hosts.)

So now that we've got the plumbing out of the way, lets explore some "real" indications on the system. One approach is to use formal documentation to figure out what classes to subscribe for. To do this for the VMware implemented classes, take a look at the API Reference doc, click on "All Classes" in the upper left, and then search for classes with "Indication" in the classname in the lower left. Alternatively, you can use a live system to figure this out on the fly. Here's a simple block of code to display all the implementation classes that might be of interest given a starting classname (it hides all the superclasses that have some derived class.)

def dumpLeafClasses(client, classname):

list = client.EnumerateClasses(ClassName=classname, DeepInheritance=True)

superclasses = set()

allclasses = set()

for theClass in list:

allclasses.add(theClass.classname)

if theClass.superclass is not None:

superclasses.add(theClass.superclass)

return sorted(allclasses – superclasses)

If we call that with the class 'CIM_Indication' then we'll get all the indications that are available on the system, in the given namespace. Lets try this on a live system with a client connection pointed at the 'root/cimv2' namespace:

>>> dumpLeafClasses(client, 'CIM_Indication')

[u'OMC_IpmiAlertIndication', u'VMware_CIMHeartbeat', u'VMware_ConcreteJobCreation', u'VMware_ConcreteJobDeletion', u'VMware_ConcreteJobModification', u'VMware_HHRCAlertIndication', u'VMware_KernelIPChangedIndication']

>>>

We could then do a GetClass on any of these to inspect their Description qualifiers to get some more details on what they're for. The one that we'll look at in this post is the OMC_IpmiAlertIndication, which is sent when a change is detected in the IPMI subsystem. In the sequence below, I'll use the CLI versions, but you can do the same via scripts as described above.

First we'll start the listener (if we don't specify a port number, it picks one at random):

% ./indicationListener.py

Listener started on port 2578

Then we'll subscribe for the indications (using the port number it just gave us):

% ./indicationSubscriber.py -P 2578 -H listener -c OMC_IpmiAlertIndication -s esxbox -u root -p ''

Cleaning up old registrations

Could not delete subscription CIMError(5, u'Class not found')

Could not delete handler CIMError(6, u'The requested object could not be found')

Could not delete filter CIMError(6, u'The requested object could not be found')

Creating handler

Creating filter

Creating subscription

Done – events are enabled for listener:2578

You can safely ignore the "could not delete" warnings – that just tells us we didn't have a previous subscription from this host.

Then we'll cause something to happen on the system. On many systems (Dell for example) there's usually a chassis intrusion sensor that will detect if you open the case. I find that to be one of the simplest indications to get to fire if you want to test to make sure you have everything wired up correctly. (If you have redundant power supplies, unplugging one of them from the wall is another easy test scenario.) Within about 15 seconds or so (the interval at which the system scans IPMI for events), on the listener window we'll see events appear:

esxbox – – [09/Jul/2010 14:07:47] "POST /test HTTP/1.1" 200 –

OMC_IpmiAlertIndication

AlertType = 8

AlertingElementFormat = 2

AlertingManagedElement = root/cimv2:OMC_DiscreteSensor.DeviceID="115.0.32.0",CreationClassName="OMC_DiscreteSensor",SystemName="44454c4c-5100-1051-8039-b3c04f464e31",SystemCreationClassName="OMC_UnitaryComputerSystem"

CorrelatedIndications = None

Description = Assert + Physical Security General Chassis intrusion

EventID = None

EventTime = 20100709160722.000000+000

IndicationFilterName = None

IndicationIdentifier = None

IndicationTime = 20100709160735.766255+000

Message = None

MessageArguments = None

MessageID = None

OtherAlertType = None

OtherAlertingElementFormat = None

OtherSeverity = None

OwningEntity = None

PerceivedSeverity = 0

ProbableCause = None

ProbableCauseDescription = None

ProviderName = RawIpmiProvider

RecommendedActions = None

SystemCreationClassName = OMC_UnitaryComputerSystem

SystemName = 44454c4c-5100-1051-8039-b3c04f464e31

Trending = Non

If we close the chassis, then we'll get the corresponding event telling us the problem has been resolved

esxhost – – [09/Jul/2010 14:08:12] "POST /test HTTP/1.1" 200 –

OMC_IpmiAlertIndication

AlertType = 8

AlertingElementFormat = 2

CorrelatedIndications = None

Description = Deassert + Physical Security General Chassis intrusion

EventID = None

EventTime = 20100709160747.000000+000

IndicationFilterName = None

IndicationIdentifier = None

IndicationTime = 20100709160801.404661+000

Message = None

MessageArguments = None

MessageID = None

OtherAlertType = None

OtherAlertingElementFormat = None

OtherSeverity = None

OwningEntity = None

PerceivedSeverity = 0

ProbableCause = None

ProbableCauseDescription = None

ProviderName = RawIpmiProvider

RecommendedActions = None

SystemCreationClassName = OMC_UnitaryComputerSystem

SystemName = 44454c4c-5100-1051-8039-b3c04f464e31

Trending = None

If you look closely at the output, you'll notice there are a few fields that are numeric types that have special meaning. This listener was written so it doesn't require a connection (and therefore credentials) on the host that's the source of the indication. If you want to pretty print the value maps, you can modify the code to follow the same approach as I described in the 3rd blog post in this series, but that will require a client connection back to at least one ESX host that has the given classes implemented.

If you've got an LSI based RAID card on your system, try subscribing for VMware_HHRCAlertIndication and simulate some failures (assuming you have a RAID level that provides redundancy, you can pull a drive and see events fire.)

Related Posts:

Related Articles

VMware vSphere Foundation: Optimizing Private Clouds and Driving IT Value

Embracing Change with VMware vSphere Foundation

Announcing New Collaborations in VMware Private AI