Many organizations today have a ROBO environment with local IT infrastructure. These remote locations usually have anywhere from a few servers running a few workloads to support local needs, to numerous servers spanning a large-scale datacenter. The distributed and remote nature of this infrastructure makes it hard to manage, difficult to protect, and costly to maintain. Further, the remote nature of servers makes it more challenging to perform important VM/host-related operations.
vSphere is designed to address these ROBO use cases, including IT infrastructure located in remote, distributed sites. VMware vCenter Server provides a centralized way to control and monitor the virtual infrastructure, including ESXi hosts, virtual machines, storage, and networking resources. It has been widely deployed in a ROBO environment to manage ESXi hosts that are distributed over large geographical distances over a wide range of networks with different network characteristics, including low/high bandwidth, network latency, and packet error rates. In the paper, we test:
LAN with high-bandwidth and low-latency links.
WAN with low-bandwidth and high-latency links.
Various networks in between; for example, DSL, T1, 4G, 5G, …
We demonstrate that vCenter Server performs well in the ROBO environment for both network bandwidth use, as well as virtual machine and ESXi host task execution times. Instead of a bandwidth restriction, we observe that network latency has a bigger impact on the overall performance. As the network latency between vCenter Server and ESXi hosts increases, the average operation latency also increases. The experimental results also show how efficiently vCenter Server executes VM operations in high-latency networks: The average VM operation execution time increases much more slowly when network latency increases by several times.
vSphere 5.1 introduced an inventory tagging feature that has been available in all later versions of vSphere, including vSphere 6.7. Tags let datacenter administrators organize different vSphere objects like datastores, virtual machines, hosts, and so on. This makes it easier to sort and search for objects that share a tag, among other things. For example, you might use tags to track a group of VMs that all have the same operating system.
Writing code to use tags can be challenging in large-scale environments: a straightforward use of VMware PowerCLI cmdlets may result in poor performance, and while direct Tagging Service APIs are faster, the documentation can be difficult to understand. In this blog, we show some practical examples of using PowerCLI and Tagging Service APIs to perform tag-related operations. We include some simple measurements to show the performance improvements when using the Tagging Service vs. cmdlets. The sample performance numbers are for illustrative purposes only. We describe the test setup in the Appendix.
1. Connecting to PowerCLI and the Tagging Service
In this document, when we write “PowerCLI cmdlets,” we mean calls like Get-Tag, or Get-TagCategory. To access this API, simply open a PowerShell terminal and log in:
Connect-VIServer <vCenter server IP or FQDN> -User <username> -Pass <password>
Along with the recent release of VMware vSphere 6.7 U2, we published a new whitepaper that shows the performance of a new scheduler option that was included in the 6.7 U2 update. We referred to this new scheduler option internally as the “sibling” scheduler, but the official name is the side-channel aware scheduler version 2, or SCAv2. The whitepaper includes full details about SCAv1 and SCAv2, the L1TF security vulnerability that made them necessary, and the performance implications with several different workload types. This blog is a brief overview of the key points, but we recommend that you check out the full document.
In August of 2018, a security vulnerability known as L1TF, affecting systems using Intel processors, was revealed, and patches and remediations were also made available. Intel provided micro-code updates for its processors, operating system patches were made available, and VMware provided an update for vSphere. The full details of the vCenter and ESXi patches are in a VMware security advisory that links to individual KB articles.
VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.
In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.
The vSAN Performance Diagnostics feature, which helps customers to optimize their benchmarks or their vSAN configurations to achieve the best possible performance, was first introduced in vSphere 6.5 U1. vSAN Performance Diagnostics is a “cloud connected” feature and requires participation in the VMware Customer Experience Improvement Program (CEIP). Performance metrics and data are collected from the vSAN cluster and are sent to the VMware Cloud. The data is analyzed and the results are sent back for display in the vCenter Client. These results are shown as performance issues, where each issue includes a problem with its description and a link to a KB article.
In this blog, we describe how vSAN Performance Diagnostics can be used with HCIBench and show the new feature in vSphere 6.7 U1 that provides HCIBench specific issues and recommendations.
What is HCIBench?
HCIBench (Hyper-converged Infrastructure Benchmark) is a standard benchmark that vSAN customers can use to evaluate the performance of their vSAN systems. HCIBench is an automation wrapper around the popular and proven VDbench open source benchmark tool that makes it easier to automate testing across an HCI cluster. HCIBench, available as a fling, simplifies and accelerates customer performance testing in a consistent and controlled way.
Virtual machine (VM) provisioning operations such as create, clone, and relocate involve the placement of storage resources. Storage DRS (sometimes seen as “SDRS”) is the resource management component in vSphere responsible for optimal storage placement and load balancing recommendations in the datastore cluster.
A key contributor to VM provisioning times in Storage DRS-enabled environments is the time it takes (latency) to receive placement recommendations for the VM disks (VMDKs). This latency particularly comes into play when multiple VM provisioning requests are issued concurrently.
Several changes were made in vSphere 6.7 to improve the time to generate placement recommendations for provisioning operations. Specifically, the level of parallelism was improved for the case where there are no storage reservations for VMDKs. This resulted in significant improvements in recommendation times when there are concurrent provisioning requests.
vRealize automation suite users who use blueprints to deploy large numbers of VMs quickly will notice the improvement in provisioning times for the case when no reservations are used.
Several performance optimizations were further made inside key steps of processing the Storage DRS recommendations. This improved the time to generate recommendations, even for standalone provisioning requests with or without reservations.
PbmCheckCompliance is automatically invoked soon after provisioning operations such as creating, cloning, and relocating a VM. It is also automatically triggered in the background once every 8 hours to help keep the compliance records up-to-date.
A new paper describes the DRS enhancements in vSphere 6.7, which include new initial placement, host maintenance mode enhancements, DRS support for non-volatile memory (NVM), and enhanced resource pool reservations.
Resource pool and VM entitlements—old and new models
A summary of the improvements follows:
DRS in vSphere 6.7 can now take advantage of the much faster placement and more accurate recommendations for all DRS configurations. vSphere 6.5 did not include support for some configurations like VMs that had fault tolerance (FT) enabled, among others.
Starting with vSphere 6.7, DRS uses the new initial placement algorithm to come up with the recommended list of hosts to be placed in maintenance mode. Further, when evacuating the hosts, DRS uses the new initial placement algorithm to find new destination hosts for outgoing VMs.
DRS in vSphere 6.7 can handle VMs running on next generation persistent memory devices, also known as Non-Volatile Memory (NVM) devices.
There is a new two-pass algorithm that allocates a resource pool’s resource reservation
to its children (also known as divvying).
PerfPsychic our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.
How to Improve AI Model Accuracy
Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.
We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.
Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.