The purpose of the exercise was to demonstrate use cases for disaster recovery of real business critical applications (BCA) leveraging VMware solutions such as VMWare Site Recovery Manager (SRM). Techniques to protect against disaster for common business critical applications such as Microsoft Exchange, Microsoft SQL Server, SAP and Oracle Databases are discussed.
Some Background on Site Recovery Manager: (SRM)
With the advent of virtualization and the concept of encapsulation of virtual machines, replicating entire business critical workloads have been greatly simplified. Virtual machines are represented as a set of files that can be replicated to the recovery site easily and re-instantiated on different hardware. VMware SRM leverages the unique aspects of virtual machines combined with replication management and workflows to automate disaster recovery for BCA.
- Automated disaster recovery failover
- Initiate recovery plan execution from the vSphere Web Client with a single click of a button
- Halt replication and promote replicated virtual machines for fastest possible recovery
- Execute user-defined scripts and pauses during recovery
- Planned migration and disaster avoidance
- Graceful shut down of protected virtual machines at the original site
- Replication synchronization of protected virtual machines prior to migration to avoid data loss
- Restart of protected virtual machines in an application consistent state
- Seamless Workflow automation with centralized recovery plans
- Create and manage recovery plans directly from the next-generation user interface of the vSphere Web Client
- Pre-define the boot sequence of virtual machines for automated recovery
- Reconfigure IP addresses upon failover at the subnet or individual address level.
The primary site is located in Wanache, Washington State and the recovery site is located in Cambridge, Massachusetts. Subject matter experts in SAP, Oracle, SQL and Exchange have created instances of these applications in the primary site. In addition infrastructure components such as domain controllers and SRM servers is also deployed. The environment and the virtual machines representing these applications is shown in Figure 1 below:
The recovery site has three vsphere servers with some local servers relating to infrastructure such as domain controllers, SRM servers and other local applications. The remote site has placeholder virtual machines created by SRM for all the protected virtual machines from the primary site.
The backbone for Disaster Recovery is replication of the protected workloads from the primary to the recovery site. Replication comes in multiple flavors.
Synchronous replication is used in Active-Active environments with zero RPO requirements. The scope of synchronous replication is within metro areas as there is a requirement to have latencies below 10 ms. Every write in the primary site is acknowledged only when it has been written to both sites. This solution is typical very expensive and used by organizations that have zero RPO as a requirement. Examples of synchronous replication are EMC vPLEX and Netapp MetroClusters.
Asynchronous replication is used for majority of disaster recovery deployments. The RPO for Asynchronous replication can range from a few minutes to hours depending on the customer requirements. This replication is usually constrained by the bandwidth between the primary and recovery sites. SRM is typically used with asynchronous replication. There are two types of replication used in SRM deployments. These include
- Storage Replication: This is array based replication provided by the storage vendor. This requires the same type and vendor of the storage solution on both the primary and recovery sites. Examples are EMC SRDF, Netapp Snapmirror, etc. These solutions are very mature and have been used over the past few decades and provide granular features and robust recovery mechanisms. Storage replication with SRM has a storage replication adapter (SRA) that is provided by the storage vendor. SRM communicates with the storage array via this adapter and uses it as a proxy for replication via the array.
- vSphere Replication: vSphere replication is a VMware solution that can replicate at the individual virtual machine level. The storage backing the virtual machines can be of any type including local storage. The primary and recovery sites can have different types of storage. vSphere replication operates at the virtual machine level and can replicate all the virtual machine disks or a chosen subset of disks. vSphere replication happens at the VMware kernel level with any changes to storage captured and replicated. The RPO for vSphere replication can range from 15 minutes to 24 hours based on customer requirements. vSphere replication requires vSphere replication appliances that are deployed by SRM. These replication appliances coordinate the replication of changes between the primary and the recovery site.
Both these replication methodologies use crash consistent recovery. The application has not been quiesced and hence the recovery is akin to that of a machine following a power outage. There is a small probability of data corruption for database like workloads. In a later section where we discuss individual business critical applications we shall look at how application level replication can be used to protect against the risk of data corruption.
vSphere replication is leveraged in this DR deployment exercise. vSphere replication is configured to individually replicate each of the virtual machines. The example below shows the configuration of replication for the Oracle virtual machine. The RPO chosen is 15 minutes and the target location in the DR site are specified.
Similarly replication is setup for all virtual machines belonging to applications that need to be protected. As part of the vSphere replication process, a relationship is established between the vSphere replication servers in the primary and the recovery sites. As a first step an initial sync is run for all the virtual machines and their disks. Once the initial sync is complete, the replication happens periodically based on the RPO setting. Through the SRM interface one can observe the status of all replicated virtual machines in the recovery site as shown below:
End of Part 1: Access Part 2 here.