Why vSAN for Disaster Recovery?
VMware vSAN is VMware’s radically simple storage solution for hyperconverged infrastructure (HCI). vSAN and VMware vSphere provide a complete, natively integrated platform consisting of compute, network, and storage resources for a wide variety of use cases including disaster recovery. Deploy on inexpensive industry-standard x86 server components to remove large, upfront investments. Since disks internal to the vSphere hosts are used to create a vSAN datastore, there is no dependency on external shared storage hardware. This helps reduce the total cost of the solution while providing sufficient capacity, reliability, and performance.
vSAN is built on an optimized I/O data path in the vSphere hypervisor for exceptional performance. It is managed as a core component of a vSphere environment meaning separate administration tools and connections are not required. This simplifies management particularly in locations that have little or no local IT staff such as a disaster recovery site.
VMware vSphere Replication™ provides asynchronous virtual machine replication with recovery point objectives (RPOs) as low as five minutes. Replication is configured on a per-virtual machine basis enabling precise control over which workloads are protected. This approach avoids the need to provide excess capacity at a disaster recovery site to accommodate an all-or-nothing replication solution. Furthermore, there is no requirement to have the same type of storage at both sites enabling more deployment options.
As an example, consider four 200GB virtual machines on a single LUN at the production site. Disaster recovery protection is needed only for two of the virtual machines. With array replication, the entire LUN (all virtual machine data) is replicated. vSphere Replication can replicate just the two virtual machines needing protection, which reduces capacity requirements at the disaster recovery site and wide area network (WAN) bandwidth consumption.
Virtual machine-centric storage policies can be created and assigned for various workload types. Policies are based on the availability and performance services provided by vSAN. These policies can be modified and reassigned, as needed, with no downtime. vSphere Replication supports storage policies. When configuring replication, a storage policy is selected, and the configured storage policy is automatically assigned to the virtual machine when it is recovered.
A variety of data protection solutions are available to back up and recover virtual machines and applications in a vSAN cluster. Many of these solutions include the capability to replicate backup data to a disaster recovery site. Having the backup data at the production and disaster recovery sites facilitates recovery from a variety of downtime scenarios.
Backup and recovery solutions can be used in the same environment as vSphere Replication. Depending on business requirements, some virtual machines could be protected from disaster by vSphere Replication and others by replicating backup data to a remote site. This approach provides flexibility in recovery times and capacity consumption at the disaster recovery site. For example, Tier-1 workloads can be replicated with vSphere Replication, which offers faster recovery times than restore from backup, but consumes more capacity at the disaster recovery site. Tier-2 workloads can be backed up locally and the backup data replicated to the disaster recovery site. It will take longer to restore a virtual machine from backup data, but the backup data will likely consume less storage capacity due to various deduplication and compression features that are built into some data protection solutions. As a footnote, the capacity consumed by vSphere Replication replicas can be reduced using vSAN deduplication and compression in all-flash configurations.
Automation with Site Recovery Manager
VMware Site Recovery Manager™ can be utilized with vSAN and vSphere Replication to orchestrate the recovery of multiple virtual machines. Automation further reduces recovery times and minimizes risk by eliminating manual, error-prone processes. Site Recovery Manager includes the ability to precisely control the startup order of virtual machines and it automates IP address changes when virtual machines are failed over. Testing recovery plans with Site Recovery Manager is non-disruptive, which enables frequent testing. Frequent testing leads to higher levels of confidence that recovery will work as planned when needed. History reports are generated with every test and failover event providing documentation to satisfy organization and regulatory requirements.
vSAN Stretch Clusters and Site Recovery Manager
For higher levels of resiliency across three sites, consider the use of a vSAN stretched cluster with Site Recovery Manager. For example, two production locations 100 kilometers apart could each house one half of a stretched cluster to protect against the failure of either location. A third location farther away hosts a second vSAN cluster to supply compute, storage, and network resources for recovered virtual machines, as well as, any workloads that run on a regular basis at the disaster recovery site.
A vSAN stretched cluster requires a “witness host”, which is vSphere running on a virtual machine. The witness host serves as a tie-breaker in certain situations such as loss of network connectivity between the two locations that make up a stretched cluster. The witness host cannot be located within the same site as the stretched cluster so the disaster recovery site is the natural place to host this virtual machine appliance. Other workloads running at a disaster recovery site might include test and development, virtual desktops, email, directory services, and DNS.
Since stretched clusters essentially utilize synchronous replication between the two locations, an RPO of zero is achieved. That means no loss of data if one of the locations in the stretched cluster is offline. vSphere HA automates the recovery of virtual machines affected by an outage at either location in the stretched cluster. Recovery time for these virtual machines is typically measured in minutes.
Replication from the stretched cluster to the disaster recovery site is facilitated by vSphere Replication. As mentioned previously, per-virtual machine RPOs for replication between two vSAN datastores can be as low as five minutes. Site Recovery Manager automates the failover and fail-back processes between the stretched cluster and the disaster recovery site.
vSAN Performance
vSAN is uniquely embedded in the vSphere hypervisor kernel. It is able to deliver the highest levels of performance without taxing the CPU or consuming high amounts of memory resources, as compared to other solutions requiring storage virtual machine appliances that run separately on top of the hypervisor. An all-flash vSAN configuration will naturally provide the highest performance, which translates to lower recovery times. This video demonstration shows a 4-node all-flash vSAN cluster recovering 1000 virtual machines in just under 30 minutes:
Recover 1000 VMs in 26 mins with SRM & VR on vSAN
Summary
VMware vCenter Server™, vSphere, and vSAN collectively create the best platform for running and managing virtual machine workloads requiring predictable performance and rapid recovery in the event of a disaster. The integration of vSAN with vSphere simplifies administration through storage policy-based management. Business-critical workloads such as websites, e-commerce applications, databases, employee remote access, and communications can benefit from shared storage without the cost and complexity of dedicated storage hardware. Site Recovery Manager can automate virtual machine migration and disaster recovery through tight integration with vSphere Replication. This includes precise virtual machine startup orders, IP address changes, and the generation of history report documentation for testing, failover, and failback operations. The health and performance levels of a vSAN datastore are constantly monitored to lower risk before, during, and after a disaster recovery. If more capacity is needed, it is simple to add using a scale up or scale out approach without incurring downtime.