Let’s start off with a cheery fact ‘the U.S. Department of Labor estimates over 40% of businesses never reopen following a disaster. Of the remaining companies, at least 25% will close within 2 years. Over 60% of businesses confronted by a major disaster close by two years, according to the Association of Records Managers and Administrators (information source).
A question I’m asked a lot is do I really need DR? Well reading the above statement, I hope the answer is yes, but in all reality the actual answer is, it depends. OK that is probably the most ‘woolly’ thing anyone in IT can say, we like hard and fast, black and white rules as engineers dammit!
For example, you may work for a company that has no on premise IT, you use a cloud based platform for your accounts, CRM and HR packages and you use hosted Exchange, SharePoint and Lync as your communication pieces, would you need DR, well the answer is probably not.
What about if you work for a company with a vSphere environment which can cater for two host failures and has redundancy on every level. This is then housed in a Tier 5 Datacenter offering 99.999% uptime, with the usual battery backed generators, diverse internet links, fire suppression systems and environmental monitoring. Connectivity is provided by diverse links to the datacentre, would you need DR then? Possibly as it depends on how the company views risk, if I was a betting man, I would say in most scenarios DR wouldn’t be necessary.
Both of the above are extremes and most SMB require on premise solutions to facilitate how they work and often they don’t have the budget to use datacentres and prefer to use remote offices for any DR activity.
So what does DR being with? Well two terms that you often hear banded about RTO and RPO, (great, I hear you say just what we need, more abbreviations in IT).
- Recovery Point Objective is the term used to describe how much data loss can be accepted. Let’s imagine you have been working on an awesome vSphere design for the past week and you had finally nailed it, but before you hit save you get a BSOD. You are a good boy/girl scout and perform backups on a daily basis, which means that your RPO is daily.
- Recovery Time Objective is the term used to describe how long you are prepared to wait to restore data. So in the above scenario, you probably swear a bit, and then you go back to your backup, perhaps it’s a USB hard drive and you find the file and restore it. RPO is the time taken to perform this procedure.
Now when it comes to DR for a business it’s just on a larger scale. Don’t just think that a DR event is a natural disaster more often than not; DR is instigated due to an outage of some type of connectivity, whether this is an inter site link between two offices or the main internet feed.
So what choices do we have as engineers to help us facilitate DR?
Traditional disaster recovery plans leave organizations exposed to significant risk of extended downtime because they are laborious to setup, time consuming to maintain as they often require manual duplication of changes, and most importantly are extremely difficult to test.
Let’s look at this scenario, Bob needs to update his ten front end web servers with a new patch released by the application vendors which resolves countless issues, Bob does this at Production site. He then needs to do the same thing for the ten front end web servers in the DR location; however Bob thinks you know what, I will do that after lunch. Bob gets back from lunch and has a critical issue to deal with, which takes the rest of the day. Bob resolves the critical issue, but forgets to patch the ten front end web servers in DR.
What’s the net result, we have an imbalance between what’s at Production and DR.
Also, how do we get the data from our Production site to our DR site? This is when we look at the next scenario software.
Software replication and clustering technologies are great; would I use them for DR? Nope. Why’s that you ask Craig, well the simple answer is overhead and cost.
I think this needs a little more explanation. Microsoft provides some great technology right out of the box, DFS R, SQL Replication and Exchange 2010 DAG. However, this actually means that you have to license multiple copies of the software and perhaps more importantly, configure, manage and maintain multiple copies.
Let’s look at Exchange 2010, a great bit of kit; however we sometimes forget what’s required to facilitate email flow at a DR site. First of all we need an anti-spam provider, with a secondary route pointing to our DR site, unless you are going to rely on a secondary MX record. We then need to have a second CAS/Hub Server at the DR site, but hold on a minute we cannot automatically fail over mail flow as the client connections are specific to a CAS server, we need to introduce a load balancer, but hey, it’ can’t be a single site load balancer as we need to use different internet breakouts in DR, so we need ‘up the ante’ and go for a global server load balancer. I think this has now entered the realms of being slightly complicated.
So what would my RPO be? Honest answer is you don’t know, with the current wave of software replication and clustering technologies its best endeavours. So in the event of DR, you don’t know how much data is going to be lost.
Perhaps a better question to ask is how do I even know it’s working correctly? Most of the time it’s let’s check it manually, open up SQL to check replication, drop a file on a DFS share and check it replicates
Software replication technologies provided by Microsoft only work on a best efforts basis as you are unable to confirm how much data has been transferred across to DR site. So in the event of a DR, you do not know how much data you could potentially lose.
Another question is how do I protect my servers that are not able to use Microsoft replication technologies; the general answer is to use a software replication application. These essentially replicate a virtual machine on a pre-set schedule to the DR site on a scheduled basis.
Software replication is susceptible to application and service failures when updates or changes are applied. Often they work on snapshots which have a rather annoying fact of not getting committed and you then having to monitor your datastores on a daily basis to check for any rogue snapshots.
What about failing back once we have failed over to DR? Is it possible? Yes however, depending on the reason for the outage could mean that failback won’t be possible for weeks or perhaps months. You might have to manually rebuild your entire Production infrastructure with your Domain Controllers, Exchange DAG, SQL Replication, DFS R File Replication.
With all of this in mind, I was introduced to VMware Site Recovery Manager just under two years ago, and without being biased what an awesome product.
VMware Site Recovery Manager
So why is Site Recovery Manager (SRM) so much different to traditional and software based replication? Well the answer is in its simplicity and also its ability to allow you to report on your disaster recovery effectiveness to management/directors.
We can perform a test failover whenever you like with whatever services you choose (as long as you have designed your storage layer correctly). Let’s expand on this a bit.
SRM utilises either vSphere Replication (included with most vSphere licenses) or Storage Based Replication. Essentially, what SRM does is allows you to take a replicated volume for example your file server which is read/write and production and read at DR. When you perform a test failover SRM sends commands to the storage layer using the Storage Replication Adapter to initiate a snapshot of the Read only volume in DR. It then transforms this volume into Read/Write to allow the VM to boot and then perhaps the most important thing is that you as the ‘vSphere Administrator’ can choose what network the VM connects to in DR. This means that you can access the VM and make sure it works, that’s pretty awesome!
So what else does SRM allow us to do?
- Change the IP Address of virtual servers on failover and failback.
- Start VM’s in priority order, ensuring that subsequent VM’s do not start until the higher priority VM’s VMTools have started.
- Pause workflows to allow for manual user intervention.
- Run custom scripts or executable during failover or failback.
- Allow re protection and failback with ease.
This last point ‘allow re protection and failback with ease’ is key. Why’s that, well in many SMB environments we don’t always have the luxury of a dedicated SAN engineer, a dedicated vSphere engineer, a dedicated Exchange engineer, more often than not, that role belongs to you, the dedicated ‘IT Engineer’
Depending on your experience you may or may not feel comfortable in taking your entire storage infrastructure and reversing its replication, I know that I for one definitely would be earning my ‘bacon’ if I had to do this.
SRM uses the SRA to send all of these commands essentially taking the onus away from us as the ‘IT Engineer’ and automates it, I mean how cool is it to be able to re protect your environment from DR to Production in three mouse clicks?
If you want to find out more about how to implement SRM and perhaps more importantly some of the key things to take into consideration, hop over to http://vmfocus.com where I have a section dedicated to this subject. Or get in touch via Twitter @vmfcraig.
Get to know Craig - Read 10 Questions With...Craig Kilborn