Disaster Recovery Active Directory Case Studies enterprice workload ESXi Exchange Microsoft Sharepoint Site Recovery Manager SQL Server SRM vSphere

Optimizing BCDR Plans in SRM: Leaving Unnecessary Disk and Data Behind at the Protected Site

Let’s talk about Flotsam. And Jetsam. No? Okay. Let’s talk about something important instead. For a change 🙂

In the “Cold” Recovery Site Topology section of our Protecting and Recovering Mission-Critical Applications in a VMware Hybrid Cloud with Site Recovery Manager Guide, we mentioned cost minimization as one of the top drivers in designing and maintaining optimal Business Continuity and Disaster Recovery (BCDR) Plans for large enterprises. We described how enterprises use “Cold DR Site” as an option for achieving this cost-reduction objective.

An astute reader would have likely asked after reading and successfully following the steps in the Guide in their own design effort: But, what about the cost of continuous data replication? Yeah, I know… “you” didn’t ask that question, but it’s OK. We will address this salient question in this appendix to the Guide.

While “Cold DR Site” does, indeed, help enterprises minimize utilization and consumption costs by not using compute resources on an on-going basis for their protected workloads, the need for continuous synchronization/replication of data from the Protected Site to the Recovery Site (even in a “Cold DR Site” configuration) means that this design option does not completely remove the cost associated with a sound and optimal BCDR Plan. Technically, a truly “no-cost BCDR Plan” is a pink Unicorn – desirable, but unattainable. The next best thing is for enterprises to be able to reduce these associated to as negligible an amount as technically and technologically possible, all without diminishing the effectiveness of their Plans.

That data must be replicated from the Protected Site to the Recovery is an unavoidable reality, if the main objective is to be able to recover and restore services as fully and quickly as possible in the event of a disaster. In looking for cost-saving, therefore, the question we have to ask is “what is the least amount of data that needs to be replicated while still achieving the desired optimal recovery outcome?” The logical answer to this question is simple: Everything you need and nothing you don’t.

Microsoft SQL Server (hereinafter simply referred to as “SQL Server”) is a very popular Business Critical Application in this enterprise. It is considered a 1st-tier candidate for BCDR in most enterprise. A common design for SQL Server in the enterprise is to separate the various file types onto separate disk volumes in production (OS files go here, Transaction Logs go here, TempDB goes over there, etc). A SQL Server instance comprises of various files and file types which, together, are required for the functionalities, stability and health of the Server, Database, Scripts, Jobs and other configuration information held by that instance. In steady state operations, a loss or corruption of any of these files can create operational challenges for the infrastructure and the administrators. So, protecting and recovering them is essential, when designing a BCDR Plan.

The TempDB is one of the most critical of these files. It is arguably the busiest of the file types, and they usually grow large in size. It is a global resource for the entire SQL Server because it is where (among other things) user and system objects are created, where results of operations and uncommitted transactions are temporarily stored. The fact that information held in TempDB files are never stored there permanently and that these files are recreated every time time the Server or Services are restarted makes the TempDB drive a very attractive place to look in our search for cost-saving and cost-minimization opportunities in a BCDR Plan design.

The TempDB files are important. Very. But they are also disposable. When they are not available, Microsoft SQL Server recreates them and carries on with its business. So, given this, why would we need to replicate a Protected SQL Server’s TempDB data to the Protected Site in a BCDR Plan? We don’t. Well, to be technically accurate, we sort of don’t. We don’t NEED the data in the TempDB disk/volume, but we do need the TempDB disk/volume itself when we are recovering an SQL Server in the event of a disaster.

Let me uncouple that a bit. A Microsoft SQL Server has knowledge of its configuration. When an Admin tells SQL Server where to store its files, SQL Server expects to find that path when it needs to write something to the file. If it doesn’t, it complains. A Microsoft SQL Server that cannot find the path to the disk/volume where its TempDB files are supposed to be written will refuse to function or provide services – until an Administrator manually or programmatically intervenes and correct the error. SQL Server doesn’t mind that the files itself are gone. It minds that it can’t find the path to which it can recreate the necessary files.

In VMware Site Recovery Manager, we have an option to protect a VM and exclude some of its disks from the protection. This feature enables enterprises to achieve the “Everything you need and nothing that you don’t” principle mentioned above. Under normal circumstances, you protect a VM and all the contents of its disks. In our SQL Server case, we have a disk (or several) in which we have disposable data. Replicating this disposable data costs money and storage space. Worse, the replicated data is actually useless in a DR scenario. If you recall, SRM recovers a VM in a powered-off state. SRM powers on the VM during the recovery. For SQL Server, the contents of the TempDB is discarded and recreated on every power-on. So, why waste money replicating unnecessary data into an environment that charges by “consumption”.

In the following paragraphs, we will walk through the process of optimizing SRM-based BCDR Plan for a production Microsoft SQL Server in order to achieve the desired cost-minimization objective.

We’re presenting a very simplified use case to avoid confusion.

  • In this scenario, we have one Microsoft SQL Server VM in our Production (Protected) Site.
  • This Server has a disk dedicated for TempDB files. Our objective is to protect and recover this important VM with SRM into our VMware vSphere-based Hybrid Cloud, at the least cost possible.
  • We will achieve this objective by excluding the TempDB disk from replication.
  • Because our SQL Server expects to find a disk for TempDB, we will provide one for it – at the Recovery Site, during recovery.
  • The disk we are providing to our recovered SQL Server is blank and empty and takes up a very small fraction of its actual size, so it helps us to considerably minimize the on-going operating costs associated with replicating unnecessary data during steady state operation.

Here is the TempDB volume in our reference SQL Server VM. It is located in the Protected Site

 

  • Here is what it looks like in our vSphere Client

 

We will create a temporary VM and add a similarly-configured disk to it on the Recovery Site

NOTE: You can name the temp VM anything you want. We just need to make sure that it runs the same OS type, then create and attach a VMDK that will represent the TempDB disk to it, ensure that the VMDK is the same size, same label, attach to the same SCSI controller slot, formatted the same way and assigned the same drive letter inside Windows.

  • This is what our Temp VM looks like in the vSphere Client

  • We give it the same Drive letter in Windows

  • And the same label

  • This is what it looks like in Windows Explorer (we don’t need to write anything to it)

And, that’s all we need from this Temp VM. We will now power it down for the next steps

We now need to go back to vCenter and DETACH (not DELETE) this disk from the VM.

  • Do not check the “Delete files from datastore” box (we don’t need the Temp VM, but we want the disk to remain available in the datastore for our purposes)

  • We are ready to get rid of the Temp VM.

  • Here is what is left behind – the 120GB VMDK file that used to be our J: in the Temp VM. Of course, because we didn’t write anything to it, it is actually just about 200MB. This is what we were after.

  • Now, we go back to our Production vCenter and configure our Production SQL Server for SRM protection. The first step in that process is to configure it to be replicated to the Recovery Site.

 

If you hadn’t been paying attention before, you want to sit up for this. This is where the cost-saving magic happens.

  • We go through the menu until we get to this screen (slow down, don’t get click-happy now…:)

We need to:

  1. Toggle that button that says “Configure datastore per disks”
  2. Unselect the VMDK representing our TempDB volume. It contains data that we don’t need to replicate

Now, we can proceed with the rest of the SRM protection configuration. We will skip the rest of the steps because they are not relevant for our purposes.

Until we get to… where SRM tells us that it’s unhappy with us because we….say it with me… are trying to save too much money?

This is expected. Remember, SRM protects a VM (and everything it contains) as a unit. But, this is what we want. This is what we set out to do. We want to leave that unwanted disk/volume behind at the Protected Site and not have to pay the cost of replicating the useless data it contains.

Let’s make the error go away, make SRM happy and ensure that our protected SQL Server VM will have its J: drive when it’s recovered.

NOTE: For the next several steps, everything we are doing will be done at the Recovery side of SRM.

  • Click on the Protection Group with the error, to bring up its configuration screen

  • Click on “Virtual Machines”
  • Check the box next to the name of the VM in question
  • Then, click “Configure Protection”

This is what it’s complaining about. If it didn’t, then we know that we have done something wrong.

Now, we go in and perform the switcheroo…

  • Click on “Browse” (remember that we are doing all of this at the Recovery Site)

  • Navigate through the Datastore and select the VMDK that we detached from our Temp VM in previous steps and left behind for this purpose.
  • Click OK

Voila!

  • Click OK

Now, our Protection Group is healthy, happy, and ready to be used

That’s it. Really. That’s all there is to it.

We have protected a production Business Critical Applications workload (in our case, Microsoft SQL Server instance) to a “Cold” recovery site and have avoided the cost associated with replicating data that we have no use for at the Recovery Site. We have achieved our goals for an optimal, cost-efficient BCDR Plan.

When this VM is recovered, it will come up at the Recovery Site with its J: drive attached to it. When SQL Server starts, it will find all its disks attached, and it will start providing the Services it used to provide before the disaster – all things being equal.

As with most things related to technologies, especially Servers and Computer-ish stuff, the unexpected can happen. One unexpected thing that could possibly go wrong in the solution we have described above is that when the VM is recovered, Windows may assign a random drive letter to the disk that we just attached. This will result in the failure of SQL Server to start. If it is just one VM, then, yeah, we could just go in and manually reconfigure the drive letter in Windows.

But, we are talking “Enterprise” here. And we’re talking about “Disaster Recovery”. Large-scale manual reconfiguration of recovered VMs is very high on the list of things we don’t want to be doing in the middle of a disaster.

So, how can we account for this?

We will call on our SRM, of course. Remember that SRM gives us the ability to run Scripts inside a VM as part of the recovery process? Yes, we can call a one-line Powershell Script that assigns our desired drive letter to the disk in Windows, as part of our VM recovery steps.

In our sample script (which we have stored in a replicated volume on the source VM), we identify the disk we want by searching for it by its Label, then we attach to it and change its letter to what we want it to be.

  • Here is what the Script looks like:

Get-CIMInstance -ClassName Win32_Volume -Filter “Label=’TempDB-Clone'” | Set-CIMInstance -Property @{DriveLetter=’J:’}

Here is how we configure SRM to call the script during the recovery steps.

  • Click on “Recovery Plans”
  • Click on the recovery plan containing our protected VM
  • Click on “Virtual Machines”
  • Select the checkbox next to the VM we want to modify
  • Click on “Configure Recovery”

  • Click on “Post Power On Steps”
  • Click “New”

  • Click “Command on Recovered VM” (this is an in-Guest operation)
  • Give the Step a name
  • Type in the command you want to execute in the Guest (in our case, we are calling a Powershell Script stored in the VM)
  • Then, click “Add”

  • Click “OK” to save our settings

Here is the Script in our Source (protected) VM

Now, go ahead. Perform a Recovery operation and observe the awesomeness of the VMware Site Recovery Manager’s feature sets and capabilities. Don’t want to replicate unnecessary data, but want the disks to be available after a recovery event? No problem. There’s an App for THAT.

Here are the two VMs with their J: drives, side-by-side. If you look closely, you will notice that there is no “Do-Not-Replicate” folder in J: drive on Recovered VM. Well, that’s because we didn’t replicate the J: drive in the first place.