Product Announcements

Automating Failover with SRM and PowerCLI

Today we’ll take a look at running a recovery plan for SRM programatically, from the API via PowerCLI.

In almost all scenarios, falling over in an automated fashion is a poor idea.  There is a lot of risk associated with it and a lot of potential liability for failing over due to incorrect reasoning.  Failing over automatically in *test mode* however makes an awful lot of sense!

We may want to execute weekly non-intrusive tests, but not need a human to sit down and run it manually each time.  We may want to be able to feed reports about DR readiness on a weekly basis without needing human intervention.  There are lots of reasons to automate DR testing, which until recently required API knowledge and programming skills.  With PowerCLI 5.5 R2 we can do this a bit more simply, so let’s see how we do it!

Looking at the SRM GUI I can see that I have two plans to run.  In my scenario, I want to run a test failover of my “Server Workloads” recovery plan.

2014 05 19 13 59 12

First we need to know how the API sees the recovery plan we want to run, so to do that we need to run

2014 05 19 14 02 48

I clearly have two recovery plans listed by the API, these are the same ones I had shown in the UI above, but listed by their managed object reference ID.

That doesn’t tell me which is which, so I can drill in a bit further using the .GetInfo method, and passing it a [0] or a [1] to find out the data I need, namely the name and description of the plans.

2014 05 19 14 07 54

Plan “[0]” is the one I wish to work with, as it is my server workload recovery plan, not the full site plan.  That means from now on I’ll be working with only that recovery plan, so I’ll set a variable to its MoRef to make subsequent calls much easier:

 

2014 05 19 14 10 39

Now this is where it gets interesting.  We need to now tell the recovery plan *how* we want to execute the plan, that is to say, in what mode.  We can run a recovery plan in one of many modes, and each mode has its own numeric value that we need to assign it:

  • 1 – Test
  • 2 – Cleanup
  • 3 – Failover
  • 4 – Reprotect
  • 5 – Revert

But the value isn’t just assigned a text value of the number equivalent of the mode.  To do that we need to take a look at the *object* associated with the recovery plan mode, located at VMware.VimAutomation.srm.Views.SrmRecoveryPlanRecoveryMode, and we need to give that object a specific value.  Then we pass the object to the method in order to run it correctly.

So we create a new variable for the recovery mode we want by first creating an object for it, then populating the value with the appropriate mode digit.

To create the object for our recovery plan mode:

2014 05 19 14 34 00

If I run a “get-member” against the RP mode, I can see that we have “value__” that we can set as a property.  What is the value?  Well again as above (and pp.28-29 of the SRM API developer’s guide), it’s a value of 1 through 5, referring to the various modes.

For these purposes, I want to set the value to “1”.  I want to automate for example a weekly run of a test where I don’t have to have an admin run it since it is non-intrusive and we can simply automate it.  For tests, this makes a lot of sense.

So I can set the value now manually by

2014 05 19 14 42 34

Okay, the value is set to 1, equivalent to “test” mode.  How do we execute the command?  Under the Api there is “Recovery.Listplans” that can be sent a method of “start” which expects a mode as part of the value.

2014 05 19 14 44 47

Remember the $RPmode object we created?  This is how I knew where to find the object, by examining the members of the ListPlans method I wanted to run. You can see the required object as the parameter that needs to be passed to the “Start” method of ListPlans.  At any rate, in order to execute the method, I’d need to do something along the lines of:

I can simplify it even further, as earlier I created a variable for $SrmApi.Recovery.ListPlans()[0] which was $RPmoref, so really, with all my variables created, I can simply execute:

2014 05 19 17 14 53

And just like that…. we have a test running.

2014 05 19 17 14 07

 

Once the test is complete, of course we’ll want to clean up automatically as well, right?  Simple… Change the digit in “$RPmode.Value__” that we set earlier.  “2” should equal cleanup.

Then run the recovery plan as before via “$RPmoref.Start($RPmode)” and we should see it automatically clean up:

2014 05 19 17 18 09

So there you have an automated test failover and cleanup.  Obviously you can do a lot more than this – you could list the plans, then do a foreach on the result and run them independently yet sequentially, for example.

You could do some error handling and add in some result code testing, etc.

You could schedule a job to execute this on a weekly basis, all sorts of ideas.

One other thing you could do, that I suggest you *not* do:  Automate real fail overs.  There are a few good reasons to do it, and many good reasons not.  The many reasons not are due to the huge potential for false positives or fat-fingering script names that then cause entire site outages and failover.  Obviously these are bad things.

The only scenario in which I can imagine using an automated failover script is if you have a different enterprise management toolkit that does full fault and root cause analysis that has given you a critical impact and severity, and perhaps even then *that* tool has still had an administrator click a big red panic button to initiate a failover.  In that scenario it might be worth it to integrate physical mainframe or unix boxes with an SRM recovery plan under a ‘manager of managers’ to automate failover.  If that is the case, and you are positive you want to automate a *real* DR failover?  I’ll give you a hint:

—-  Edited May 30 2014

Something I love about our community is that I can get almost instant feedback from our brilliant professional services group and product management.  Turns out that the order in which our codes are listed in the docs are not necessarily lined up with the instantiation in digital.  I had originally listed the Mode Value as “3” for failover, as per the sequence in the guide, whereas that is actually cleanup!  Failover is “0” instead.  Edited for correctness, and thank you very much to Raymond Milot for finding that mistake!

Beyond that, Ben is one of our product managers extraordinaire and incredibly well versed in the depths of our APIs, and has pointed out in the comments that indeed you do not *need* to assign a numeric value, but instead can use an English value that is interpreted just as well.  As per his comment rather than setting the RPmode.Value to a number, you can set the entire $RPmode string to the appropriate value, a la:

You may choose to use either method – numbers are easier to not mess up and good for quick substitution, a text string is obviously considerably easier to understand!

Thanks very much also to Ben!