This is a repost from Hany Michael‘s personal blog, Hypervizor.com.
Nearly six years ago, when VMware first introduced Site Recovery Manager (SRM), it was quite a big hit in the world of enterprise customers. Being a VMware customer myself at that time, I remember how this represented a significant improvement in how we handle our DR scenarios. Specifically, we shifted from long and exhausting runbooks full of manual instructions to simply constructed recovery plans. In fact, I was so grateful for this solution that I immediately started evangelizing about it and created a full series of videos, covering everything from installation all the way to configuration.
Fast-forward 6 years, and the conversations that I have with my customers are completely different now. Disaster recovery for them is already a given. SRM is a part of almost every environment that I have seen over my past 5 years at VMware, and customers use it on regular basis for planned or unplanned VM recovery. What has changed since then, and what do customers require as we stand today are, hopefully, that questions that this blog post and the associated reference architecture will provide the answer to.
Business Drivers
The greatest challenge that I see in customer environments today is around the operational part after a disaster strike. Recovering virtual machines from one site to another cannot be simpler today with the great enhancements of SRM, especially when you combine it with vSphere Replication (VR). However, the problems that almost all my customers face today are primarily around IP address changes and how that breaks, to a large extent, how the applications work all together.
Another challenge faced by customers with private cloud models is the day-two operations through their cloud management platform (CMP). If you have a CMP like vRealize Automation (formally vCloud Automation Center), and your end users – like developers, QEs, or application owners – access it on a daily basis, how do you bring that platform up and running after a disaster recovery, and most importantly, what is your recovery time objective (RTO) for that?
Think of it this way: If you are a large bank, service provider, or telco that had already adopted cloud computing and achieved all the great benefits of it, how do you ensure a fast RTO for both your applications and that cloud platform itself to continue your day-to-day business operations after a disaster? What is the use of your BCDR plan and architecture if you cannot resume your normal business at the recovery site exactly as you used to do in your protected site? Let’s take a practical example here to put things into perspective. You have a development department in your business and your developers use vRealize Automation (vRA) on a daily basis to provision applications (not just VMs). If this department is of great importance to your business, how will you be able to provide them with the same platform to resume their work the very next day after a disaster? Are you going to recover the same environment on your DR site? Or are you going to build another vRA instance? If it’s the former, how fast can you recover your vRA platform, and if it is the latter, how will you be able to maintain the same configuration across the two sites? Now take that example and try to apply it for different departments in your organization, like the IT Ops responding to business requests for new apps through vRA, or your NOC that had already established specific monitoring dashboards through vRealize Operations and so forth.
Objectives and Outcomes
If you haven’t already done so, I would highly recommend to checkout the VMware’s IT Outcomes at this link. In a nutshell, VMware has grouped a number of IT outcomes and mapped them to how organizations can deliver more business value. Our solution here, in turn, maps to the following three IT outcomes:
– High Availability and Resilient Infrastructure
– Streamlined and Automated Data Center Operations
– Application and Infrastructure Delivery Automation
To translate that into clear and concise objectives, and to set the stage for the following sections in the article, here are what we are trying to achieve the very same day or the next day at the latest after a disaster recovery is complete:
1) Enable our application owners to access the same CMP (vRA portal in our case) to resume their day-to day operations like accessing their VM consoles, create snap-shots, add more disk space, etc
2) Enable our developers to continue their test/dev VM provisioning the exact same way they used to do before the failover. That include, but not limited to: using the very same Blueprints, Multi-machine templates, NSX automated network provisioning, etc.
3) Enable our IT-Ops to provision VMs (IaaS) or applications (PaaS) through the vRA portal without altering any configurations like changing IP addresses or hostnames, etc.
4) Enable our NOC/Helpdesk to monitor the applications, services and servers through vRealize Operations Manager (vROM) the same way they did in the original site. No changes in accessing the portal, default or customer dashboards previously created, or losing any historical data.
5) Enable our higher management to access vRealize Business to see financial reports on the environment and cost analysis the same way they did before the disaster and failover took place.
If you are motivated at this point but you feel that it’s all marketing talk, it’s time to dig dipper into the proposed architecture, because from this point onwards, it cannot get more technical.
The Architecture
First and foremost, I always like to stress the fact that all of my reference architectures are validated before I publish them. As a matter of fact, in order to create this architecture, I spent nearly two weeks between testing and validation on our internal VMware cloud, and the end result was 40+ VMs. This ranged between physically simulated devices (like edge routers) to management components (like vCenter Servers and SRM), all the way to nested VMs to simulate the actual applications being failed-over and failed-back across sites. Now, let’s examine this architecture in detail.
Starting from the top to bottom, the following are the major components:
1) Datacenters: This represents a typical two-DC environment in remote locations. The first is the protected site in Cairo, and the second is the recovery site in Luxor. This could still be two data centers in a metro/campus location, but that would be just too easy to architect a solution for. It is worth mentioning also that, in my current cloud lab, I have this environment in a three-datacenters architecture, which works pretty much the same. I just didn’t want to overcomplicate the architecture or the blog post but know that this solutions works perfectly well with 2+ Datacenters.
2) The vRealize Suite Infrastructure: These are the vRS virtual appliances/machines that should be typically running in your management cluster. That’s your vRA front-end appliances, your vRA SSO, the IaaS components, and the database plus the other components in the vRealize Suite like vR-Operations and Business. What you see different here is that we are connecting these VMs to a logical switch created by NSX, and it is represented in the diagram by (VXLAN 5060). You will know in a minute why this platform is abstracted from the physical network. Another important note to point out here is that this vRS environment could still be distributed and high-available; in fact, I have the vRA appliances load-balanced in my lab with a one-arm Edge Services Gateway (ESG). For detailed information about architecting a distributed vRA solution, you can check my blog post here. For simplicity, I included the vRA nodes in a standalone mode, but the latter distributed architecture will work just fine in our scenario here.
3) The Virtual & Network Infrastructure: In the third layer of this architecture comes the management components of the virtual infrastructure, such as vCenter Server, SRM, NSX Manager, and so forth. Everything you see in this layer is relevant to each datacenter independently. For example, we will never fail-over the vCenter server from one site to another. The same thing holds true for the infrastructure services like DNS or Active Directory domain controllers. I will be talking in detail about the networking and routing subjects in another section, but for now, I would like to point out that we have a traditional IP fabric for the management workloads, represented here by VLAN 110 and subnet 192.168.110.0/24 in the Cairo datacenter, while we have VLAN 210 and subnet 192.168.210.0/24 in the Luxor datacenter. The two networks are routed using a traditional MPLS connection or any L3 WAN cloud, depending on how your environments are designed.
4) Resources: In this layer, you can see our vSphere clusters. We have here a management cluster for running your VMware-related or infrastructure services workloads. The second cluster is your production workloads cluster, which runs your business applications. The third and last cluster here is your test/dev environment. This three-cluster architecture doesn’t have to be exactly the same for you. For example, some of my customers run their management workloads along with the production applications. Other customers separate the management from production clusters but run their UAT environment along with their production workloads. These are all valid design scenarios, with pros and cons for each choice that are beyond our scope here. What I just want to point out is that this solution will work just fine with any of those three architecture choices.
5) SRM: This is just an illustration layer showing the SRM constructs in terms of Protection Groups and Recovery plans and their association with the operations layer beneath.
6) Operations: This layer is detailing the various operations related to this architecture from application owner provisioning all the way to the SRM admin recovery of workloads. We will come to these operational subjects in detail later in the article.
Now that we had an overview on the architecture, it is time to discuss subjects in detail.
It’s all about ‘Abstraction’
To buildup on what I have mentioned at the beginning of this article, our main goal here is to abstract as much infrastructure as possible in order to achieve the required flexibility in our design.
What you see here is that the entire vRealize Suite is abstracted from the traditional management network/portgroup (VLAN-backed) to an NSX Logical Switch (VXLAN-backed). This included vRealize Automation, Operations, Business and Log Insight.
Traditionally, you would connect these components to a VLAN-backed portgroup which is most likely the same as your vCenter Server network. We are not doing this here because we want to maintain the same IP addressing of all these components to avoid, in turn, the requirement to change them when we failover to the second site.
If you look closely to the architecture, you will see that the NSX Logical Switch (or VXLAN) on each site has the same IP subnet which is 172.16.0.0/24. The internal interface (LIF) of the DLRs at each site also has the same IP address which is 172.16.0.1. This is the default gateway of all the vRealize Suite components. When you failover those VMs from one site to another, firstly, they will maintain the same IP addresses, and secondly, they will still have the same default gateway (the DLR LIF interface, that is). Furthermore, they will be already configured with two DNS servers: 192.168.110.10 and 192.168.210.11. These are the two existing Active Directory domain controllers sitting at each site. In case of a disaster in the first site, the 192.168.110.10 will be gone but the 192.168.210.11 will still be alive on the second site that the vRS VMs are being failed-over to. And guess what, all the DNS records are already the same since they are replicated across the two domain controllers. That’s basic AD functionality. For example, vra-portal.hypervizor.com is resolving to 172.16.0.12 which is the vRA appliance. Another example, vra-sso.hypervizor.com is resolving to 172.16.0.11 and that entry already exists on both the DNS servers.
With that, we are achieving the maximum flexibility and the very least amount of changes required to be done after a failover takes place. We are maintaining the same IP addresses of the vRS nodes, we are maintaining the same default gateway, and we are maintaining also the name server configuration along with all the DNS records between sites.
Routing and Switch-over
So how does the routing work here and how do you switch-over from one site to another after a disaster? Let’s have a look.
You will see in the architecture that the external (uplink) interface of the DLR is connected to the traditional management VLAN in each site. In case of the Cairo datacenter it is 192.168.110.5. The default gateway of the latter is the actual L3 router/switch in that site (like a Cisco Nexus 7K in a typical core/aggregation/access design, or a Nexus 9K in a spine/leaf architecture). That DLR needs no further routing to be configured to go out to the physical network, but it needs a routing back from that physical network to the abstracted virtual networks (172.16.0.0/24 in our case). This is exactly why we need to have a static route on that L3 device to say: in order to reach the 172.16.0.0/24 network, you have to go through the next hop router which is the DLR external interface (192.168.110.5). Easy enough, that’s basic networking. Of course you can configure and use dynamic routing (like OSPF) to exchange the routing information dynamically between the DLR and its upstream L3 device. We do not really need that here since it’s just one network that is static in nature and does not change.
Now, as long as the vRS is “living” in the Cairo datacenter, the static route on your L3 device will be active there. But what happens when we failover to a second site? The answer is easy. At that point, the Cairo datacenter will have been out of the picture so you can adjust the routing on your DR site, which is Luxor in our case here. This is simply by adding a static route entry like the one above: in order to reach the 172.16.0.0/24 network, your next hop router is 192.168.210.5. This is the DLR external interface that is already sitting idle there in the Luxor site. That very step could be quite easily part of your physical network team switch-over procedures when a disaster recovery is declared.
Replication, Protection and Recovery
We are using here, of course, vCenter Site Recovery Manager as the engine for orchestrating and automating our workloads failover. There are few things to examine here:
1) Replication:
I am leveraging here the vSphere Replication 5.8 mainly for the incredible flexibility it gives us, not to mention the great enhancements and performance improvements in the latest 5.8 release. We basically need to setup first all the replication of those vRS VMs only once to replicate from Cairo to Luxor, and then setup our Protection Groups. If you already have an array-based replication, and you are quite comfortable with it, then by all means you can still use it here. A typical configuration in this case would be to gather all your vRS VMs into one LUN and set your replication to the secondary site. Same configuration of the Protection Group can follow that.
2) Infrastructure Mapping:
It is important to set your infrastructure mapping between the two sites before you proceed to the Protection Groups configuration. Failing to do so could lead to some generic errors that you might not troubleshoot easily. For example, if you do not map your NSX Logical Switches together, the Protection Group configuration (in the next step) will fail with a generic error to check your settings.
3) Protection Groups and Recovery Plans:
After you set your replication for the VMs and map your infrastructure items, the next step is to setup your Protection Groups. In our case here, we are configuring two protection groups. The first for the vRA nodes which will consist of the vRA Virtual Appliance, vRA SSO Appliance, Windows IaaS VM and the vRA Database. We are including the DB VM since it is a vital component of the vRA instance and it has to move along the other VMs. You have the option here to do DB based replication (like SQL Log Shipping) if you feel more comfortable with that. Needless to say that, by replicating the entire VM, you guarantee a faster and automated recovery of the vRA instance.
The second Protection Group will contain the other vRealize Suite components like vRealize Operations, Business and Log Insight. You could combine those virtual appliances with the previous protection group but better and easier management, we are segregating them into two.
Next comes the Recovery Plans. The configuration is fairly simple here where you point your recovery plan to the protection group. You could have here one Recovery Plan containing both the vRA and vRS Protection Groups mentioned above.
All the above is relevant to the replication, protection and recovery of the vRS infrastructure. With the production workloads, we will be adopting a different configuration mechanism that will be, at some extent, automated via the vRA end-user portal. I will explain that in the next part of this blog post.
Static vs. Dynamic net/sec configurations
You may have already noticed that we have two type of clusters, one designated as “NSX: Automated” and the other is not. To explain what does that mean, we have to look at the function of each cluster first.
The Production cluster in this architecture is designated to host workloads that are dynamically provisioned through vRA, however, we do not require to automate the underlying networking for it. In other words, if we have a typical multi-tier application with Web, DB and App VMs, those tiers will already have a pre-provisioned VXLANs (or VLANs). In case of VXLANs, or Logical Switches as we call them in the context of NSX, you can simply pre-configure them on the other side as well. This is pretty much what we did for the vRA infrastructure itself. If you look closely to that vRA app, you can more or less consider it as a typical enterprise application with Web (vRA Appliance), DB (MS-SQL) and Application (Manager Service) tiers. These are static components that do not and will not change in nature. With that said, pre-configuring your networking on both sides is done only one time and that is it.
On the other hand, in the world of the test/dev dynamic provisioning, you would require to dynamically provision the networking and services around your applications. Let’s say you are developing a SharePoint application. You would require not just to provision multiple instances of this app with the same networking requirements (e.g. same ip addressing in an isolated networks), but also to provision NSX Edge devices to load balance its web tier as an example.
Now, since we cannot auto-sync the NSX configurations across sites (yet!), the test/dev workloads will not be failed over to the second site. Yes you are losing those VMs in a case of a disaster, but how many customers currently do VM recovery for their test/dev workloads? In the same time, we are still having an advantage here when you look at the overall architecture. Your developers will still be able to provision test/dev workloads in the very same way they always do after a disaster recovery. The reason being is that you will always have the spare capacity sitting in the DR site and your blueprints ready to provision workloads from the vRA portal that has shifted to that site after the recovery.
Nevertheless, there is an internal effort at VMware currently in the works to allow this type of net/sec configuration synchronization across the sites. If this is a hard requirement for you, you will still be able to do it once this mechanism (which will be driven by vRealize Orchestrator) is available. They key point here is that you do not have to change anything in this architecture, it will always be your foundation and then support the net/sec sync across site later (should you require that).
Interoperability Matrix
Before I conclude this first article of two, I would like to go through the interoperability matrix of the products in this architecture.
We have vSphere 5.5 as the foundation of everything which translates to vCenter Server 5.5 and ESXi 5.5. We have vCenter Site Recovery Manager 5.8 along with the vSphere Replication 5.8 for the DR automation, orchestration and replication. We have then the vRealize Suite 6.0 which consist of vRealize Automation 6.2, vRealize Operations Manager 6.0 and vRealize Business. Everything just mentioned is part of the vCloud Suite 5.8. Now the last components, and most important I would say, is the NSX 6.1 for vSphere.
One of the common questions I have received internally at VMware when I showcased this solution is whether it will work with vSphere 6.0 or not. The answer is absolutely yes! In fact, with vSphere 6.0 you would be able to take this to the next level and start live-migrating the entire vRealize Suite across the site without any service interruption. Think about situations like datacenter level maintenance or DC-migrations/consolidation and how that would be very efficient in terms of uptime and business continuance.
Conclusion
In this article I’ve explained how it is of a great advantage to abstract your vRealize Suite components into NSX driven virtual networks. By that, we have demonstrated how it is fast and reliable to recover your entire cloud infrastructure and the operational model around it in a matter of hours rather than days or weeks. We have done that without the need to change any settings, execution of runbooks or standing up new stack of software.
In the next part of this article, I will go in detail around the vRA configuration and the recovery operations of the cloud workloads. I will also list down the frequently asked questions that I have been receiving from colleagues or customers around this solution when presenting it to them to complete the picture for you.