Uncategorized

My thoughts during a DR event

As a TAM one of the most important tasks I get involved in is pro-active planning for a DR event. This is something that has become quite important in the past few years for many of my customers as they push past their physical server to virtual server numbers and start to see a tipping point where they have more VM's than physical servers. Of course VMware has an awesome solution for the VMware environment, SRM however this post is not just about SRM and I want to look a bit deeper into what DR and BC actually require from a planning perspective.

Let's take a typical event that I experienced recently. A customer of mine had a complete failure at their main data center. They, like most of my customers have a mixture of physical and virtual servers located in the data center, the majority being virtual. Their virtual servers are protected by SRM. They choose to only protect critical systems for now and not everything due to cost concerns around replication of large volumes of data, this will change when they move to SRM 5 with vSphere Replication as an alternative. This seems to be fairly typical and definitely a point to note as we move to SRM 5 and vSphere Replication increasing our customers options for DR using SRM. Taking a step further of course we need to look at the rest of the environment and how it is prepared for a DR event. Removing the networking requirements and making the assumption that this will be made available at the DR site as required we need to analyse the physical systems. What seems to be normal here in customers that are over 60% virtualised is that many of their most important workloads may still not be virtualised. These systems may include databases, directory services and other extremely important systems. What is interesting when these are analysed is that these systems often perform the task of being a dependency for a particular system that is already virtualised.

So what do we have here. Well a number of observations thus far. Firstly in a highly virtualised environment it is fairly easy to protect the virtualised part using SRM, however most often only a subset is chosen. This subset is the most critical of the virtualised systems and they often have dependencies on other systems that are still physical. These systems are where the issue comes in.

If we have a DR event it is fairly easy to fail over to the DR site with SRM (a posting for another time on how this works). The issue however is what will happen with the dependencies on the other side to physical systems and what about desktops, how are they maintained in a complete outage situation. Are all of these dependencies mapped out and available in the DR site as part of the DR plan? How do we know if they are available when they are up and successfully ready to accept communication with the virtualised environment?

This then leads to the question, is our virtualised environment more avalable than our physical environment on which it may still depend for some of it's resources, and if this is the case are we even able to fail over to the DR site or is there any value in doing this.

When an event of catastrophic proportions occurs we don't have time to mess around and fiddle with systems trying to figure out what is virtualised, what is physical and how the DR event will take place in the new DR site. What seems to happen  most often is that the virtual environment is not simply failed over as it should due to all of these questions and potential issues and often all that occurs is downtime at the primary site until the event is over.

This seems like a major waste and for me having DR just to tick a box to appease the corporate auditors.

But really we need to do more.

I would love for all of my customers to virtualise 100% but this might not always be practical for a variety of technical or political reasons, so it is best to assume that there will be a proportion of the systems that are still physical. Here are some of my steps I would consider in a DR plan.

Step 1 is to have a solution in place that automatically identifies all of the systems in your organisation and their dependencies. Mapping this out and keeping this up to date automatically is key. This could also feed into your CMDB for consistency as well.

Step 2 is to identify the services that you wish to protect in the case of a DR event. Identifying services rather than individual systems will force you to focus on the components that make up the service and therefore need to be available for it to operate.

The next step is to identify exactly what systems are physical and what are virtual based on the services map and put both of these into the DR plan. The dependencies will dictate the run order and any special requirements on the DR side when an event takes place.

You also need to pay careful attention to desktops as this is the entry point to most systems, how will they access the DR site, will they have connectivity to the DR site, are they available at the DR site, are they physical endpoints with a fat client or are they thin client desktops with a centralised source for the desktops to maintain their state so that users are then able to connect and continue as before.

Finally if you are protecting an environment that is complex and does include physical and virtual systems, which I suspect most are, you need to perform regular functional testing that includes full reporting and testing at the DR site as the SRM test may not be good enough in these cases as it may only prove that your virtual environment can successfully fail over.

I would love all of my customers to simply virtualise all of their systems, that way we could have a much simpler plan but while this is not a reality at the moment for most we need to be pragmatic on how we approach a DR event, ensuring that no system that is required is left behind for any service we are protecting.

I would love to hear your comments on how this has gone for you if you have suffered such an event.