IT Thought Leadership

How VMware IT Redefined Mission-Critical Day 2 Operations, and Consistently Delivered Exceptional Colleague Experiences

by: VMware Senior Director, IT Chanh Chi; VMware Lead Cloud Infrastructure Administrator Mohammed Kajamoideen; VMware Director, Cloud Infrastructure Operations Zaigui Wang

VMware IT employs VMware vSAN™, an enterprise-class storage virtualization software, for lifecycle management of regional deployments such as VMware Horizon® 7and branch office footprints—in addition to main production data centers.

While Day 0 and Day 1 are important, Day 2 operations are a consistent mission-critical concern. That’s why VMware IT developed a variety of vSAN best practices to ensure exceptional colleague (end user) experiences 24/7, anywhere and anytime.

We started by maintaining consistency across hosts with VMware ESXi™ version, firmware, drivers and similar configurations. Mixed versions were acceptable during upgrades, and everything was monitored via VMware vCenter®, VMware vSphere® PowerCLI™, and VMware vRealize® Operations™ dashboards.

A screenshot snapshot of a vSAN readout

A sample dashboard

Superior networking

For networking, we implemented a variety of advanced measures. Dedicated and isolated VMkernel ports significantly improved performance, uptime, and security. This involved a dedicated virtual LAN (VLAN) and a separate network interface controller (NIC) for different traffic types. Jumbo frames were enabled where possible. A vSphere Distributed Switch™ (VDS), included with vSAN licenses, makes full use of network I/O control. VMNET3 paravirtualized adapters (enabled with VMware tools) increased performance and I/O throughput—an RSS buffer can also be activated and/or increased if required. And we only used HCL-compatible NICs.

State-of-the-art monitoring and troubleshooting

Monitoring and troubleshooting vSAN clusters was simplified, and much more efficient. The system now offers real-time health status and guidance, checking for such things as hardware compatibility, network connectivity/settings, physical device health/firmware checks, and vSAN build recommendations—in addition to enabling continuous updates without the need to upgrade the vSAN itself.

vSAN support insight provides anonymized data to VMware Global Support Services (GSS), real-time identification of conditions, reduced data gathering efforts, an improved reactive support experience, and proactive support capabilities. Non-performance-related issues are handled in a similar manner.

Challenges met (and remedied)

There were challenges along the way while recovering the virtual machines (VMs) with VM disk formats (VMDKs) skipped from the vSphere Replication™. However, this can be overcome by manually performing disk mapping of the excluded disk at the SRM level. VM replication will be paused during the new disk addition or removal. By leveraging the vCenter alarms, admins can generate the email alerts to capture and monitor the state changes of the VM replication.

Good lessons for today . . . and tomorrow

We also learned a number of lessons that we are applying to future deployments. A thin swap enabled by default at cluster level on vSphere 6.7u1 and above saves substantial space on the vSAN datastore. (The settings need to be manually configured on each ESXi host in vSAN cluster lower than 6.7u1.) Ensuring network stability is vital as performance impediments. Only HCL-compatible NICs should be used, and employing LACP can boost performance as well as improve bandwidth and network adapter redundancy.

We also discovered that there is an increased maintenance window as vSAN increases ESXi maintenance time, and ESXi reboots are longer after an upgrade. Since preferred RAID5/RAID6 erasure coding is only available with All-Flash, All-Flash is recommended over hybrid. This enables less space consumption, in addition to improved, highly predictable and uniform overall performance. If NSX-T is to be deployed, VxRail must be ordered with a minimum of four NICs. And at least 30% unused space should be kept to handle any unforeseen failure.

Combined, the technologies and lessons learned have turned ongoing VMware Day 2 operations into a seamless, efficient, and remarkably hassle-free undertaking.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. Contact your sales rep or [email protected] to schedule a briefing on this topic. Visit the VMware on VMware microsite and follow us on Twitter.