By: Chanh Chi (Senior Director, IT), Charlene Huang (Database Architect), Jino Jose (Lead Applications Database Administrator), Grant Nowell (Director, IT Infrastructure PMO & Compliance), Donald Philpott (Senior Manager, Global DC Metro & Backbone), Stephen Sheen (Manager, IT Infrastructure PMO) and Zaigui Wang (Senior Manager, Cloud Infrastructure)
VMware’s production workloads, including the majority of mission-critical applications, were run out of a leased data center with low power density and very expensive to maintain. Luckily, a major technology refresh cycle provided an unprecedented opportunity to perform VMware’s largest migration of live workloads (4,000+ virtual machines supporting 400+ lines of business) to a completely different physical data center—an undertaking that was fraught with numerous complexities and potentially significant negative impacts if the move failed. This would additionally require intricate coordination between application teams, infrastructure maintenance teams, and those involved in the physical move of hardware.
Architected guidelines led the way
From the start, clear and actionable guiding principles were put in place. Ensuring minimal downtime required implementing an L2 network extension that leveraged VMware vSphere® vMotion® to move virtual machines (VMs) across VMware vCenters®, and storage array replication to move data across vCenters.
A separate network link exclusively for migrations was created to prevent throttling of existing production traffic. Parallel tracks for capacity and physical moves were also put in place, as was a network cutover implementation plan. Application moves required discrete parallel tracks for efficient IT planning, execution and minimized business impact.
A program everyone tuned into
The program approach consisted of creating a leadership and execution committee, monitoring and control protocols, and associated strategies. The committee was tasked with ensuring any and all issues were dealt with in real time, with buy-in from all applicable stakeholders. Monitoring and control were tracked by an in-house configuration management database (CMDB) tightly integrated with the production application and infrastructure landscapes. The communication strategy maintained consistent and cohesive communications, as well as monitoring of major reports (network utilization, vCenter VM migration vs. decommissioned VMs, stretched-network migration progress, and percentage-complete stages). Finally, the execution strategy encompassed management of all tracks in parallel in order to enable all teams to work as independently as feasible.
Infrastructure preparation involved auditing and matching existing compute capacity in the new data center (to ensure ever-changing demands could be easily met), refreshing old hardware with updated versions (this represented majority of hardware employed), and repurposing existing hardware from the old data center.
As mentioned, data migration was accomplished using network extension (mandatory for any live workload), storage vMotion, and replication. Some workloads could not be migrated live, although workarounds with minimal impact were implemented. SAN-based replication was employed to replicate Raw Device Mapping (RDM) hardware.
We also used cross-vCenter vMotion and HCX (Hybrid Cloud Extension) to assist with the migrations.
Since we were migrating to a new data center that was managed by a new set of vCenter Servers and NSX Managers, we had to ensure our NSX micro-segmentation firewall rules would not be lost through the migration process. To achieve this, we exported the existing rules from the old environment and imported them into the new environment before migrating the virtual machines. Prior to migrating, we also updated existing rules to ensure the applications would not break when VMs were incrementally migrated over to the new data center.
Disaster recovery and business continuity—themost important considerations—involved adherence to an extensive checklist for each application migrated.
Network infrastructure challenges
Challenges in this area included maintaining the IP addressing scheme, providing L2 network stretched access between data centers, not changing the existing L2 VLAN numbering design for migrated workloads, and offering backup for the primary L2 extension. These were remedied through various hardware and software solutions that will be discussed in another blog.
Application and database migration
vMotion ensured that mission-critical applications and databases would not suffer any downtime, although the individual size of the app/database would affect response times. Overall migration was grouped into five different batches, with each batch containing different sets of applications. Less critical and independent applications were migrated first to reduce the network chatter between data centers. The goal was to migrate all application components—vMotion databases and applications, cutover of dependent VIPs, plus network attached file systems (NAS), network file systems (NFS) and common internet file systems (CIFS)—with minimal downtime, if any. A staging environment allowed for testing the application stack behavior and performance before migrating to production. This ensured the least impact to production.
We decided to move to a 25 Gbps NIC in the new data center to increase network density. It worked out well but we started finding some packet drops that affected some sensitive applications. Updating the NIC firmware and drivers to the newest versions helped to stabilize the situation.
The migration was an unqualified success, with a remarkably smooth production cutover. Credit goes to a thoroughly planned/designed migration approach, seamless team collaboration, and multiple testing runs in non-production environments to finalize the ultimate live approach.
VMware on VMware blogs are written by IT subject matter experts sharing stories about IT’s transformation journey using VMware products and services in a global production environment. Visit our portal to learn more.