Deploying a Resilient Infrastructure for Core Services at Remote Offices

by: Senior Manager – Cloud Infra Zahari Slavov

Read about the strategy for a resilient infrastructure across VMware’s remote offices in part one of this blog, Managing a Remote Office.

Providing resiliency for the core infrastructure services for VMware remote offices across the globe is critical to our company operations. We had reached a point where a global standard was required due to the growing number of remote sites. To establish this standard, we looked at three main issues:

Defining core infrastructure services. Services such as Active Directory, DNS, DHCP, printing, proxy services, load balancing, and security were obvious core services. Defining what services were core was a challenge and subject to scope creep. The final list was winnowed down to reflect the shared business-critical needs of all remote sites.
Selecting a common delivery platform. Hard drives, servers, the network, and power are part of a common platform. They had to be able to sustain multiple failures and continue to operate until the issue was remediated. Hyper-converged infrastructure (HCI) technology was chosen as the foundational technology for its ease of deployment, management, and support and its compact footprint.
Reducing rack space. Another key consideration was shrinking the physical hardware footprint down from our current 10 rack units (RUs). Less equipment would translate to lower temperatures in the rack, which would require less cooling. We hoped to open up rack space to other teams.

HCI: The Solution

Our plan combined two HCI solutions: the Dell EMC VxRail HCI appliance, which combines Dell EMC technology with VMware software and the Dell EMC VMware vSAN Ready Nodes. This technology merges servers, storage, and networking into a single solution, which is easier to deploy and manage. Our footprint was reduced from 10RUs to 4 RUs, freeing up 6RUs of rack space for other uses.

Under our global architecture, each site has its own management stack with 24×7 monitoring. Consistent procedures were launched across all sites, regardless of the appliance or location. This approach provides resiliency; failures can be contained to a site and not affect an entire region.

In addition, hardware redundancy ensures that that another office can automatically take over service control in the event of a failure. A set of load balancers, along with some DNS, DHCP, and IPAM (DDI) management software, automatically redirect services from the office with the failure to a working location. The customer impact is minimal in most cases.

Remote Office Core Services Configuration

Resiliency is also an integral part of our switch cabling standard. Two converged switches operating as a single network appliance provide the resiliency and network capacity required to power this solution. There is logical isolation and physical isolation of the DMZ and internal trusted traffic. Multiple physical uplinks are incorporated into the solution to route the traffic, thus preventing the traffic from “bleeding” into trusted or untrusted spaces. The solution is redundant as each appliance splits its physical connections against the two converged switches.

In the event of a switch failure, all traffic is routed to the second switch in the stack. This traffic switch occurs within microseconds. Host/node traffic is then handled by the second 10G connection. When a host/node fails, VMware High Availability (HA) and VMware Distributed Resource Scheduler (DRS) take action to perform an automatic failover with load balancing; the impact of the host/node failure is the same as on any traditional cluster. VMs restart on other healthy hosts/nodes within the cluster. The advantage is that the converged platform reduces both the cost and time to recover.

Lesson Learned

We will complete our first wave of installations this year; in 18 months we will begin the update process again. We’ve seen several benefits:

With five years of support instead of the previous three, we were able to implement a formal lifecycle management cadence.
By planning on 20% YOY capacity growth, we have been able to avoid exceeding our disk/memory/CPU capacity and the resulting painful upgrades.
Rigorous testing and scheduled rollouts ensure all sites are running on the same software versions, making centralized support much simpler.
By using the latest supported operating environment, we have been able to streamline the proper patching against known threats and vulnerabilities.
Our strategy is flexible enough to handle adjustments, such as an expansion in core infrastructure services at one of our R&D hubs.
Our architecture will make it easy to deploy NSX for application-level security through micro-segmentation.
When we encounter HCI technology issues, we share them with Dell EMC and VMware for future product improvement.

Resiliency is the ability to protect our business colleagues from work disruptions while adapting to the constant change of IT. A global approach means that the core services at our remote sites are now much more robust and reliable. Our team can maintain acceptable service levels so that when issues invariably arise, they are invisible to our users. More importantly, the user experience is consistent across different sites.

To read why a resilient infrastructure is so important to VMware’s remote offices, read the blog: Managing a Remote Office.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. Contact your sales rep or vmwonvmw@vmware.com to schedule a briefing on this topic. Visit the VMware on VMware microsite and follow us on Twitter.

Related Articles

How VMware Duplicated an On-Premises Experience for the Multi-Cloud

Cultivating a Sustainable Culture with VMware Tanzu CloudHealth

Bright Ideas. How VMware IT Moved to a Multi-Cloud Ecosystem and Achieved 99.99 Percent Availability