Home > Blogs > VMware VROOM! Blog > Tag Archives: vsphere

Tag Archives: vsphere

VDI Capacity Planning with View Planner

By Mohit Mangal, Atul Pandey, and Alok Sen

Virtual desktop infrastructure (VDI) deployments are on the rise due to employees working from home all over the world. VDI users can work remotely from anywhere and on any device—like tablets, thin clients, and laptops—while accessing remote desktops hosted on private data centers and in the cloud.

Good user experience is a key requirement for a successful VDI deployment. With the typical deployment ranging from hundreds to thousands of remote desktops, measuring and optimizing the user experience is challenging because of complex architecture, hardware requirements, OS and application variations, various display protocols, and different performance requirements of various users.

View Planner Measures the Performance of VDI-User Workloads

VMware View Planner is a benchmark and capacity planning tool designed to simulate real-world workloads in a large-scale VDI environment. View Planner captures the user experience of simulated VDI users at the client side and measures latencies to identify infrastructure problems of those virtual desktops. The benchmark is scalable from a few virtual machines running on one host, to thousands of virtual machines distributed across a cluster of hosts.

Workloads and Work Profiles

View Planner application automations are called workloads. You can choose to measure the performance of  applications like Google Chrome, MS Word, MS Excel, Adobe Reader, and other user actions like moving files and so forth. You can also develop your own workloads.

A group of workloads make up a work profile in View Planner. You can either select one of the standard work profiles or create your own based on the applications you want to test. You can find details about work profiles at Understanding VMware View Planner Work Profiles.

Run Modes

View Planner supports local, remote, and passive-remote run modes for testing. Remote and passive-remote run modes use client VMs to connect with multiple VDI desktops. Local mode generates the workload directly on the desktop VMs. You can find more details about run modes at VMware View Planner Operation.

Reports

View Planner generates a PDF report after completing a test. The report includes quality-of-service (QoS) data, application operation latencies, and resource usage of the hosts involved in test. This data can then be used for optimizing the VDI deployment. You can find more details about reports at  Understanding VMware View Reports.

We Experiment to Find the Optimal Number of VMs to Run on the vSphere Host

By running the standard View Planner workload against several VMs that ran on a single vSphere host, we found the maximum number of VMs that could be run on the host—this is known as VM consolidation. The results showed that 130 VMs could be run on our host while the workload performance was within View Planner’s quality of service (QoS) rules, and it maintained its optimal resource usage (memory, CPU, and storage) on the host.

Testbed Setup

For this test, we used a 2-socket node with following configuration:

Processor Type Intel Xeon Gold 6240 Processor @ 2.60GHz
Logical processors 72
Memory 766.46 GB
Storage 2.91 TB
vSphere/ESXi version 6.7.0
View Planner 4.4

We allocated the following resources to each VM:

vCPUs 2
Memory 4 GB
Storage 50 GB

We decided to perform a remote-mode test because it simulates a real user experience. We used BLAST as our VDI protocol choice (View Planner also supports PCoIP and RDP). We also decided to go with the Standard_Profile  work profile, which includes typical applications a knowledge worker would use: Adobe Reader, Google Chrome, Microsoft Office, and Windows media player.

 

We selected and monitored several key parameters to ensure good user experience and system performance with optimal resource usage.

Performance Parameter

 

Threshold
CPU-sensitive operations (Group A) < 1 Second
Storage-sensitive operations (Group B) < 6 Seconds
Ratio of actual-to-expected operations (O/E ratio) > 0.9
Discarded desktop count < 2%
Memory usage of any desktop host < 90%

 

View Planner Installation and Configuration

In the following sections, we explain the steps we performed for the benchmark setup and experiments. For more detail of this process, see the VMware View Planner Documentation.

Step 0: Prerequisites

We created a VDI environment on a host node that included vCenter, Horizon View, and Active Directory with SSL enabled.

We also added two hosts in our vCenter: one for the desktop VMs and another one for the client.

Step 1: We installed View Planner

View Planner has two main components:

  • The View Planner harness is an OVA file.
  • The View Planner agent is a Windows installer file.

The View Planner harness is the controller that drives the benchmark, while View Planner agent is installed on the desktops to generate load. View Planner components can be downloaded from the product page.

Step 2: We set up and configured the View Planner harness

We deployed the View Planner harness OVA file using the vSphere Client of vCenter Server. Once the OVA is deployed and powered on, the View Planner UI was accessible using http://harness-ip/vp-ui. Below is a snapshot of the View Planner home page after a successful login. Notice the squares on the left side–we call these tabs.

Screenshot of View Planner successful login, which brings you to the home page

Figure 1. This is what the View Planner home page looks like after you successfully log in. The RUN tab is selected.

Step 3: We configured the VDI environment

Next, we selected SERVERS on the left menu and configured the VDI environment details using the steps in Configuring the View Planner Harness. We added our vCenter as an Infra Server, Microsoft Active Directory as the Identity Server, and Horizon View as the VDI Server. Adding Identity and VDI Server details are required only for remote and passive mode tests. Once added, we used the test button to validate the server configuration.

Screenshot of our setup for Add Infra Server

Figure 2. We configured Add Infra Server.

Screenshot of adding identity server

Figure 3. We added the Identity Server.

Screenshot of adding a VDI server

Figure 4. We added the VDI Server.

Step 4: We created some vSphere virtual machines

We created 144 desktops and 144 client virtual machines. We used the desktop virtual machines to automate the application workload and the client virtual machines to simulate real users by initiating desktop sessions using the Horizon Client.

Step 5: We set up the desktop virtual machines

We created a desktop parent VM using the steps given in Setting Up the View Planner Windows Desktop Virtual Machines. We validated the setup by rebooting the VM and confirming that the View Planner agent command Prompt was up after autologin and there was no error.
We used the desktop parent VM to create a Horizon pool of 144 VMs on the desktop host. We created instant clones for our tests. Note that full and linked clones are also supported.

Step 6: We set up the client virtual machines

We created a client-parent VM using steps given in Setting Up the View Planner Windows Client Virtual Machines. We validated the setup by rebooting the VM and confirming that the View Planner agent command prompt was up after autologin and there was no error.

Using the client-parent VM, we created 144 client VMs on the client host by following the steps in Creating Clones Using View Planner. Alternatively, you could create client VMs using VMware Horizon or manual cloning.

Step 7: We created a run profile

A run profile acts as template for the View Planner tests. We created a run profile with 5 iterations and used the Standard_Profile work profile because this includes all of the workloads required for our test. You can get a list of available work profiles in the WORK PROFILE tab on the left-side menu. You can create your own work profile if none of the available profiles matches your requirement.

Screenshot of the View Planner run profile we used

Figure 5. We used the standard run profile.

Based on our experience, we started the first round of tests by allocating 4 VMs per core. Our host has 36 cores, so we come up with the VM count of 144. This number can be changed in subsequent tests based on the threshold criteria defined earlier.

You can find detailed steps for creating new run profile at Creating a Run Profile.

Step 8. We ran View Planner

Finally, we started the 144 VM test run by selecting the RUN tab. A standard work-profile run takes approximately 120 minutes and can vary depending on the virtual machine count. You can also use the RUN tab to track live-run status during the test period. For details, see Starting a View Planner Run.

Screenshot of View Planner after running it

Figure 6. We ran the standard profile by clicking the RUN tab.

Run Results

When a run is complete, View Planner generates a PDF report, in which there are sections for quality of service (QoS) and host resource usage. QoS is reported separately for CPU-sensitive operations (Group A) and storage-sensitive operations (Group B). The upper acceptable threshold for Group A is 1 second and for Group B is 6 seconds.

The following screenshots show QoS and host resource usage from the report from our 144 VM run. From the Host Resource Usage, we considered Average CPU Usage (85.06) and Average Memory Usage (50.61).

Screenshot of a table showing View Planner run results

Table 1. Quality of service results for our run of 144 VMs

Results from host resource usage

Table 2. Results of host resource usage for run of 144 VMs

The above screenshots show that the QoS of storage-sensitive operations is above the threshold limit. So, we decided to decrease the VM count to 130 in our run profile and performed another test run. The following screenshots show QoS and host resource usage from the report of our 130 VM run.

Screenshot of View Planner run results for 130 VMs

Table 3. View Planner run results of quality of service for 130 VMs

Screenshot of table showing results for host resource usage

Table 4. Results of host resource usage for run of 130 VMs

Conclusion

Based on our tests, we can safely say that our single ESXi host can support 130 VMs when users simultaneously run and interact with typical office applications on these VMs. We kept a 17% margin in CPU Usage, which provided a buffer for peak times.

Contact Us

If you have any questions or want to know more, reach out to VMware View Planner team at viewplanner-info@vmware.com. The View Planner team actively monitors this email address.

 

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 3 – View Planner Virtual Desktops

By Rahul Garg, Milind Inamdar, and Qasim Ali

In the first two parts of this series, we looked at the performance of a SQL Server database using the DVD Store 3 benchmark and a mixed workload environment with the VMmark benchmark running on the AMD 2nd Gen EPYC processors and VMware vSphere. With the third part of the series, we look at virtual desktop performance on this platform with the VMware View Planner benchmark.

View Planner

View Planner is VMware’s internally developed benchmark for virtual desktops, or virtual desktop infrastructure (VDI). It simulates users doing common tasks with popular applications like word processors, spreadsheets, and web browsers.  View Planner measures how many virtual desktops can be run while maintaining response times within established limits.

Complete details about View Planner, including how to set up and run it, are documented in the VMware View Planner User Guide. We used version 4.3.

Screenshot of the View Planner 4.3 user guide

Figure 1. VMware View Planner 4.3 User Guide 

AMD EPYC

AMD 2nd Generation EPYC processors have an architecture and associated NUMA BIOS settings that we explored in the initial two posts in this series. Here, we look at how NPS and CCX as NUMA settings (defined below) affect the performance of a VDI environment with View Planner.

Recap: What are CCDs and CCXs?

In the previous AMD EPYC blogs, we included figures that show the AMD 2nd Generation EPYC processors have up to 8 Core Complex Dies (CCDs) connected via Infinity Fabric. Each CCD is made up of two Core Complexes (CCXs). For reference, we’ve included the following diagrams again.

AMD EPYC (Rome) logical diagram

Figure 2. Logical diagram of AMD 2nd Generation EPYC processor

Two Core Complexes (CCXs) comprise each CCD.  A CCX is up to four cores sharing an L3 cache, as shown in this additional logical diagram from AMD for a CCD, where the orange line separates the two CCXs.

Logical diagram of CCX

Figure 3. Logical diagram of CCD/CCX

What is NUMA Per Socket?

NUMA Per Socket (NPS) is a BIOS setting on the ESXi host that allows the processor to be partitioned into multiple NUMA domains per socket. Each domain is a grouping of CCDs and their associated memory. In the default case of NPS1, memory is interleaved across all the CCDs as a single NUMA node. With NPS2, the CCDs and their memory are split into two groups. Finally, with NPS4, the processor is partitioned into 4 nodes with 2 CCDs each and their associated memory.

What is CCX as NUMA?

CCX as NUMA presents each CCX as a NUMA domain. When enabled, this causes each AMD 2nd Generation EPYC processor used in our tests to have 16 NUMA nodes, for a total of 32 across both sockets.

Performance Tests

For this performance study, we used the following as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processor, 2TB of memory, and 4.66TB of storage
  • Windows 10, version 1909: We installed this as the desktop VM and configured it with 2 vCPUs, 4GB of memory and 50GB of disk space.
  • VMware vSphere 6.7 U3
  • View Planner 4.3: Legacy Mode enabled.
  • Horizon View 7.11: We used Instant Clones for VDI setup. Refer to VMware Horizon 7 Instant-Clone Desktops and RDSH Servers for details.

For all tests, the number of desktop VMs were increased in successive test runs until we hit the quality-of-service (QoS) limit of View Planner.

We tested NPS1, 2, and 4 both with and without CCX as NUMA enabled. The default settings of NPS1 and CCX as NUMA disabled was the baseline for the tests: relative results are shown for all tests in the comparison.

Because of the large amount of CPU capacity on this ESXi host with 128 cores across two sockets, a large number of the Windows 10 desktops were able to be hosted. With each desktop VM assigned 4GB of RAM, the total amount of RAM assigned to VMs was more than the 2TB of physical RAM in the ESXi host. This resulted in what we call a system that was memory overcommitted.

Which configurations achieve the most performant number of desktops within QoS?

The results below show (in green) NPS1, NPS2, NPS4 without CCX as NUMA are all within a couple of percent of the number of desktops supported. The results also show (in blue) about a 7% decline in the number of desktops supported with CCX as NUMA.

Figure 3. View Planner VM consolidation with AMD 2nd Gen EPYC with NPS and CCX as NUMA settings

Figure 4. View Planner VM consolidation with AMD 2nd Gen EPYC with NPS and CCX as NUMA settings

We see this performance degradation when CCX as NUMA is enabled for the following reason: View Planner is a VDI workload that benefits from ESXi’s memory page sharing when the environment is memory overcommitted as we were on these tests. Memory page sharing works in instant clones by sharing the memory of the running parent VM from which the instant clone is created. However, if the instant clone is running on a different NUMA node than the parent VM, page sharing may be lost. By default, ESXi only shares pages within a NUMA node because remote memory access is often costly. With CCX as NUMA, 32 NUMA nodes are exposed to ESXi, restricting page sharing within a NUMA node. The result is a significant loss of memory page sharing–related savings that ends up in memory ballooning.

To confirm the reason for the ~7% performance degradation, we performed internal testing that allowed us to enable inter-NUMA page sharing. With this test configuration, there was no memory ballooning, and we gained the ~7% performance back.

So, what is the answer? The default is best.

The View Planner tests show that the ESXi hosts’ out-of-the-box settings of NPS 1 and CCX as NUMA disabled provide the best performance for VDI workloads on vSphere 6.7 U3 and AMD 2nd Generation EPYC processors.

Commentary about CCX as NUMA

CCX as NUMA exposes each CCX as its own NUMA domain to the hypervisor or operating system. This is a false representation of the NUMA architecture that is provided as an option to aid the hypervisor or OS scheduler in certain scenarios that are currently not highly optimized for the AMD architecture—that is, there are multiple Last Level Caches (LLCs) per NUMA node.

This false representation can skew certain assumptions the hypervisors or operating systems are built upon related to NUMA architecture (for example, memory page sharing as we described above). We believe once the hypervisor and OS schedulers are fully optimized, the option of CCX as NUMA may not be needed.

 

 

Performance Best Practices Guide for vSphere 7.0

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 7.0. This is a comprehensive book designed to help system administrators obtain the best performance from their vSphere 7.0 deployments.

The book covers new features as well as updating and expanding on topics covered in previous versions.

These topics include:
  • Persistent memory (PMem), including using PMem with NUMA and vNUMA
  • Getting the best performance from NVMe and NVME-oF storage
  • AMD EPYC processor NUMA settings
  • Distributed Resource Scheduler (DRS) 2.0
  • Automatic space reclamation (UNMAP)
  • Host-Wide performance tuning (aka, “dense mode”)
  • Power management settings
  • Hardware-assisted virtualization
  • Storage hardware considerations
  • Network hardware considerations
  • Memory page sharing
  • Getting the best performance from iSCSI and NFS storage
  • vSphere virtual machine encryption recommendations
  • Running storage latency-sensitive workloads
  • Running network latency-sensitive workloads
  • Network I/O Control (NetIOC)
  • DirectPath I/O
  • Microsoft Virtualization-Based Security (VBS)
  • Selecting virtual network adapters
  • vCenter database considerations
  • The vSphere HTML5 Client
  • VMware vSphere Lifecycle Manager
  • VMware vSAN performance
The book can be found here.

 

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 2 – VMmark

In recently published benchmarks with VMware VMmark, we’ve seen lots of great results with the AMD EPYC 7002 Series (known as 2nd Gen EPYC, or “Rome”) by several of our partners. These results show how well a mixed workload environment with many virtual machines and infrastructure operations like vMotion can perform with new server platforms.

This is the second part of our series covering application performance on VMware vSphere running with AMD 2nd Gen EPYC processors (also see Part 1 of this series). This post focuses on the VMmark benchmark used as an application workload.

We used the following hardware and software in our testbed:

  • AMD 2nd Gen EPYC processors (“Rome”)
  • Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3
  • VMware VMmark 3.1

VMmark

VMmark is a benchmark widely used to study virtualization performance. Many VMware partners have used this benchmark—from the initial release of VMmark 1.0 in 2006, up to the current release of VMmark 3.1 in 2019—to publish official results that you can find on the VMmark 3.x results site. This long history and large set of results can give you an understanding of performance on platforms you’re considering using in your datacenters. AMD EPYC 7002–based systems have done well and, in some cases, have established leadership results. At the publishing of this blog, VMmark results on AMD EPYC 7002 have been published by HPE, Dell | EMC, and Lenovo.

If you look though the details of these published results, you can find some interesting information like disclosed configuration options or settings that might provide performance improvements to your environment. For example, in the details of a Dell |EMC result, you can find that the server BIOS setting Numa Nodes Per Socket (NPS) was set to 4. And in the HPE submission, you can find that the server BIOS setting Last Level Cache (LLC) as NUMA Node was set to enabled. This is also referred to as CCX as NUMA because each CCX has its own LLC. (CCX is described in the following section.)

We put together a VMmark test setup using AMD EPYC Rome 2-socket servers to evaluate performance of these systems internally. This gave us the opportunity to see how the NPS and CCX as NUMA BIOS settings affected performance for this workload.

In the initial post of this series, we looked at the effect of the NPS BIOS settings on a database-specific workload and described the specifics of those settings. We also provided a brief overview and links to additional related resources. This post builds on that one, so we highly recommend reading the earlier post if you’re not already familiar with the BIOS NPS settings and the details for the AMD EPYC 7002 series.

AMD EPYC CCX as NUMA

Each EPYC processor is made up of up to 8 Core Complex Dies (CCDs) that are connected by AMD’s Infinity fabric (figure 1). Inside each CCD, there are two Core Complexes (CCXs) that each have their own LLC cache shown as 16M L3 cache in figure 2. These diagrams (the same ones from the previous blog post) are helpful in illustrating these aspects of the chips.

Figure 1. Logical diagram of AMD EPYC Rome processor

Figure 2. Logical diagram of CCX

The CCX as NUMA or LLC as NUMA BIOS settings can be configured on most of the AMD EPYC 7002 Series processor–based servers. The specific name of the setting will be slightly different for different server vendors. When enabled, the server will present the four cores that share each L3 cache as a NUMA node. In the case of the EPYC 7742 (Rome) processors used in our testing, there are 8 CCDs that each have 2 CCXs for a total 16 CCXs per processor. With CCX as NUMA enabled, each processor is presented as 16 NUMA nodes, with 32 NUMA nodes total for our 2-socket server. This is quite different from the default setting of 1 NUMA node per socket for a total of 2 NUMA nodes for the host.

This setting has some potential to improve performance by exposing the architecture of the processors as individual NUMA nodes, and one of the VMmark result details indicated that it might have been a factor in improving the performance for the benchmark.

VMmark 3 Testing

For this performance study, we used as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processors and 1 TB of memory
  • 1 Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3 installed on a local NVMe disk
  • VMmark 3.1—we ran all tests with 14 VMmark tiles: that’s the maximum number of tiles that the default configuration could handle without failing quality of service (QoS). For more information about tiles, see “Unique Tile-Based Implementation” on the VMmark product page.

We tested the default cases first: CCX as NUMA disabled and Numa Per Socket (NPS) set to 1. We then tested the configurations of NPS 2 and 4, both with and without CCX as NUMA enabled.  Figure 3 below shows the results with the VMmark score and the average host CPU utilization.  VMmark scores are reported relative to the score achieved by the default settings of NPS 1 and CCX as NUMA disabled.

From a high level, the VMmark score and the overall performance of the benchmark does not change a large amount across all of the test cases. These fall within the 1-2% run-to-run variation we see with this benchmark and can be considered equivalent scores. We saw the lowest results with NPS 2 and NPS 4 settings. These results showed 3% and 5% reductions respectively, indicating that those settings don’t give good performance in our environment.

Figure 3. VMmark 3 performance on AMD 2nd Gen EPYC with NPS and CCX as NUMA Settings

We observed one clear trend in the results: a lower CPU utilization in the 6% to 8% range with CCX as NUMA enabled.  This shows that there are some small efficiency gains with CCX as NUMA in our test environment. We didn’t see a significant improvement in overall performance due to these efficiency gains; however, this might allow an additional tile to run on the cluster (we didn’t test this).  While some small gains in efficiency are possible with these settings, we don’t recommend moving away from the default settings for general-purpose, mixed-workload environments.  Instead, you should evaluate these advanced settings for specific applications before using them.

AMD EPYC Rome Application Performance on vSphere Series: Part 1 – SQL Server 2019

Exciting new server platforms based on the second generation of AMD EPYC processors (Rome) have become recently available from many of our hardware partners. The new Rome processors offer up to 64 cores per socket—that’s a big increase over the previous generation of AMD processors. This means that a two-socket server using these processors has 128 cores and 256 logical threads with simultaneous multi-threading (SMT) enabled, making two-socket servers look more like four-socket servers in terms of core counts.

This is the first blog in a series that will take at look at the performance of some different workloads on the AMD EPYC Rome processor on VMware vSphere. Today we’re giving you the results of our tests on Microsoft SQL Server 2019.

The AMD EPYC Rome processor is built with Core Complex Dies (CCDs) connected via Infinity Fabric. In total, there are up to eight CCDs in the EPYC 7002 processor (Rome), as shown in figure 1.

Figure 1. Logical diagram of AMD EPYC Rome processor

Two Core Complexes (CCXs) comprise each CCD.  A CCX is up to four cores sharing an L3 cache, as shown in this additional logical diagram from AMD for a CCD, where the orange line separates the two CCXs.

Figure 2. Logical diagram of CCD

The AMD EPYC 7002 series processors in some ways simplify the architecture for many applications, including virtualized and private cloud deployment. There are more details on the EPYC Rome processor as well as a comparison to the previous generation AMD EPYC processors in a great article written by Anandtech.

AMD EPYC 7002 series (Rome) server processors are fully supported for vSphere 6.5 U3, vSphere 6.7 U3, and vSphere 7.0.  For all tests in this blog, vSphere 6.7 U3 was used.

The server used for testing here was a two-socket system with AMD EPYC 7742 processors and 1 TB of memory. Storage was an all flash ExtremeIO Fibre Channel array with a 4TB LUN assigned to the test system. vSphere 6.7 U3 was installed on a local NVMe disk and used as the basis for all tests.

Testing with SQL Server 2019

Microsoft SQL Server 2019 is the current version of this popular relational database.  It’s widely used by VMware customers and is one of the most commonly used applications on the vSphere platform. It’s a good application to test the performance of both large- and medium-sized virtual machines.

For the test, we used the SQL Server workload of the DVD Store 3 benchmark. It’s an open-source online transaction processing (OLTP) workload that simulates an online store.  It uses many common database features such as indexes, foreign keys, stored procedures, and transactions.  The workload is measured in terms of orders per minute (OPM), where each order is made up of logging in, browsing the store, reading and rating reviews, adding items to the shopping cart, and purchasing them.

For all tests, the number of worker threads that simulated users were increased in successive test runs until the maximum OPM was achieved and then began to decline, or stay the same, as additional threads are added.  At this point, CPU utilization was between 90 and 100 percent.

We created a Windows Server 2019 VM and installed SQL Server 2019 on it.  For the later tests this VM was cloned multiple times to be able to quickly scale-out the test setup.

Scale Up Performance of a Monster VM

With such a large number of cores available, it was natural to test how much performance was possible when scaling up to the maximum size of vCPUs per VM (a Monster VM). We configured the scaled up VM with 512 GB of RAM and a DVD Store test database of about 400GB.

We compared the maximum throughput for 64 and 128 vCPU VMs and found good scalability. The 128 vCPU VM achieved 1.86 times the throughput of the 64 vCPU VM.  This small fall off in scalability is due to the additional NUMA node, which results in some increased latency. Additionally, the sheer number of cores involved in such large systems caused slightly higher overhead to manage for the vSphere scheduler.

Figure 3. Scale-up performance from 64 vCPUs to 128 vCPUs for a single VM.

 

Scale-Out Performance of Multiple VMs

To test the scale-out performance of a vSphere environment, we cloned the SQL Server 2019 VM until we had eight.  We configured each VM to have 16 vCPUs with 128 GB of RAM.  This allowed us to have a maximum number of active vCPUs in the test to be equal to the 128 cores in the server.  Additionally, we configured the size of the DVD Store test database to be about 100GB.  We did this to scale the workload to the size of the VM.

The results below show that the total throughput continues to increase as the number of VMs is increased to eight.  In total, the eight VMs were able to produce slightly over 6x what a single VM could achieve.

Figure 4. As we scaled out the 16-vCPU VM from 1 to 2, 4, and 8 VMs, we observed the eight VMs were able to produce slightly over 6x what a single VM could achieve.

Optimizing Performance Opportunities with AMD EPYC Rome

As mentioned at the beginning of this post, AMD EPYC Rome processors used in this test are made up of eight CCD modules, each with 8 cores.  Within each CCD there are two CCXs that share an L3 processor cache. Each CCD has an associated memory controller.  With default settings, all eight CCDs and their memory controllers act as one NUMA node with memory access interleaved across all memory controllers.

There is an option in the BIOS settings to partition the processor into multiple NUMA domains per socket.  This partitioning is based on grouping the CCDs and their associated memory controllers.  The option is referred to as NUMA per socket or NPS, and the default is 1.  This means that there is one NUMA node per socket.  The other options are to configure it to 2 or 4.  In the case where NPS is set to 4, there are 4 NUMA nodes per socket, with each NUMA node having 2 CCDs and their associated memory.

If the VM sizes allow for them to align with the NPS setting in terms of cores and memory, then there is the opportunity for performance gains with some workloads.  In the specific case of our scale-out performance testing that we looked at above, there were 8 SQL Server VMs with 16 vCPUs and 128 GB of RAM each.  This lines up with an NPS 4 setting – 1 VM per NUMA node with 16 vCPUs matching 16 cores per NUMA node.  Additionally, there is 128 GB of RAM for each VM as well as 128 GB of RAM in each NPS 4–based NUMA node for our system with 1TB of RAM.

When tested, this configuration of VMs with such nice alignment resulted in a 7.8% gain in throughput for the NPS 4 setting over the default of NPS 1.  NPS 2 showed only a negligible gain of 1%.

Figure 5. Because of good alignment, the NPS 4 setting gained 7.8% in throughput over the NPS 1 setting, compared to the NPS 2 setting, which showed only a 1% performance improvement.

It is important to note that not all workloads and VMs will gain 8% or even any performance just by using the NPS 4 setting.  The performance gain in this case is due to the clean alignment of the VMs with NPS 4. Compared to NPS 1, where multiple VMs were probably not confined across their own set of caches and were stepping on other VM’s cache usage.  In this specific scenario with NPS 4, each VM basically has its own NUMA node with its own set of L3 processor caches and lower memory latency due to the interleaving across only the local memory for the CCDs being used.  In circumstances where VM size is uniform and nicely aligns with one of these NPS settings, it is possible to obtain some modest performance gains.  Please use these settings with caution and test their effect before using them in production.

 

 

Weathervane 2.0: An Application-Level Performance Benchmark for Kubernetes

Weathervane 2.0 lets you evaluate and compare the performance characteristics of on-premises and cloud-based Kubernetes clusters.

Kubernetes simplifies the deployment and management of applications on a wide range of infrastructure solutions. One cost of this simplification is a new set of decisions that you must make when acquiring or configuring Kubernetes clusters, including selecting the underlying infrastructure, configuring network and storage layers, sizing compute nodes, etc. Each choice can impact the performance of applications deployed on the cluster. Weathervane 2.0 helps you understand this impact, by:

  • Comparing the performance of Kubernetes clusters
  • Evaluating the impact of configuration decisions on cluster performance
  • Validating a new cluster before putting it into production

When using Weathervane, you only need to build a set of container images, edit a configuration file, and start the benchmark. Weathervane manages the deployment, execution, and tear-down of all benchmark components on your Kubernetes cluster.

Weathervane 2.0 is available at https://github.com/vmware/weathervane.

Continue reading

CPU Hot Add Performance in vSphere 6.7

Leaving CPU Hot Add at its default setting of disabled is one of the performance best practices that we have for large VMs. From the Performance Best Practices Guide for vSphere 6.7 U2:

CPU Hot Add is a feature that allows the addition of vCPUs to a running virtual machine. Enabling this feature, however, disables vNUMA for that virtual machine, resulting in the guest OS seeing a single vNUMA node. Without vNUMA support, the guest OS has no knowledge of the CPU and memory virtual topology of the ESXi host. This in turn could result in the guest OS making sub-optimal scheduling decisions, leading to reduced performance for applications running in large virtual machines. For this reason, enable CPU Hot Add only if you expect to use it. Alternatively, plan to power down the virtual machine before adding vCPUs, or configure the virtual machine with the maximum number of vCPUs that might be needed by the workload. If choosing the latter option, note that unused vCPUs incur a small amount of unnecessary overhead. Unused vCPUs could also cause the guest OS to make poor scheduling decisions within the virtual machine, again with the potential for reduced performance. For additional information see VMware KB article 2040375.

The reason for this is that if you enable CPU Hot Add, virtual NUMA is disabled. This means that the VM is not aware of which of its vCPUs are on the same NUMA node and might increase remote memory access. This removes the ability for the guest OS and applications to optimize based on NUMA and results in a possible reduction in performance.

Virtual NUMA (vNUMA) exposes NUMA topology to the guest operating system, allowing NUMA-aware guest operating systems and applications to make the most efficient use of the underlying hardware’s NUMA architecture. (For more information about NUMA, see page 27 in the Performance Best Practices Guide for vSphere 6.7 U2.)

To get an idea of what the performance impact can be by enabling CPU Hot Add, a simple test was run in our lab environment. This test found performance with the default setting of CPU Hot Add disabled performed from 2% to 8% better than when CPU Hot Add was enabled.

Continue reading

New White Paper: High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning

By Dave Jaffe, VMware Performance Engineering

A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.

Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.

The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.

Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.

To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.

The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).

All details are in the paper.

vSphere 6.7 Update 3 Supports AMD EPYC™ Generation 2 Processors, VMmark Showcases Its Leadership Performance

Two leadership VMmark benchmark results have been published with AMD EPYC™ Generation 2 processors running VMware vSphere 6.7 Update 3 on a two-node two-socket cluster and a four-node cluster. VMware worked closely with AMD to enable support for AMD EPYC™ Generation 2 in the VMware vSphere 6.7 U3 release.

The VMmark benchmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms and has become the standard by which the performance of virtualization platforms is evaluated.

The new AMD EPYC™ Generation 2 performance results can be found here and here.

View all VMmark results
Learn more about VMmark
These benchmark result claims are valid as of the date of writing.

Introducing VMmark ML

VMmark has been the go-to virtualization benchmark for over 12 years. It’s been used by partners, customers, and internally in a wide variety of technical applications. VMmark1, released in 2007, was the de-facto virtualization consolidation benchmark in a time when the overhead and feasibility of virtualization was still largely in question. In 2010, as server consolidation became less of an “if” and more of a “when,” VMmark2 introduced more of the rich vSphere feature set by incorporating infrastructure workloads (VMotion, Storage VMotion, and Clone & Deploy) alongside complex application workloads like DVD Store. Fast forward to 2017, and we released VMmark3, which builds on the previous versions by integrating an easy automation deployment service alongside complex multi-tier modern application workloads like Weathervane. To date, across all generations, we’ve had nearly 300 VMmark result publications (297 at the time of this writing) and countless internal performance studies.

Unsurprisingly, tech industry environments have continued to evolve, and so must the benchmarks we use to measure them. It’s in this vein that the VMware VMmark performance team has begun experimenting with other use cases that don’t quite fit the “traditional” VMmark benchmark. One example of a non-traditional use is Machine Learning and its execution within Kubernetes clusters. At the time of this writing, nearly 9% of the VMworld 2019 US sessions are about ML and Kubernetes. As such, we thought this might be a good time to provide an early teaser to VMmark ML and even point you at a couple of other performance-centric Machine Learning opportunities at VMworld 2019 US.

Although it’s very early in the VMmark ML development cycle, we understand that there’s a need for push-button-easy, vSphere-based Machine Learning performance analysis. As an added bonus, our prototype runs within Kubernetes, which we believe to be well-suited for this type of performance analysis.

Our internal-only VMmark ML prototype is currently streamlined to efficiently perform a limited number of operations very well as we work with partners, customers, and internal teams on how VMmark ML should be exercised. It is able to:

  1. Rapidly deploy Kubernetes within a vSphere environment.
  2. Deploy a variety of containerized ML workloads within our newly created VMmark ML Kubernetes cluster.
  3. Execute these ML workloads either in isolation or concurrently to determine the performance impact of architectural, hardware, and software design decisions.

VMmark ML development is still very fluid right now, but we decided to test some of these concepts/assumptions in a “real-world” situation. I’m fortunate to work alongside long-time DVD Store author and Big Data guru Dave Jaffe on VMmark ML.  As he and Sr. Technical Marketing Architect Justin Murray were preparing for their VMworld US talk, “High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning [BCA1563BU]“, we thought this would be a good opportunity to experiment with VMmark ML. Dave was able to use the VMmark ML prototype to deploy a 4-node Kubernetes cluster onto a single vSphere host with a 2nd-Generation Intel® Xeon® Scalable processor (“Cascade Lake”) CPU. VMmark ML then pulled a previously stored Docker container with several MLperf workloads contained within it. Finally, as a concurrent execution exercise, these workloads were run simultaneously, pushing the CPU utilization of the server above 80%. Additionally, Dave is speaking about vSphere Deep Learning performance in his talk “Optimize Virtualized Deep Learning Performance with New Intel Architectures [MLA1594BU],“ where he and Intel Principal Engineer Padma Apparao explore the benefits of Vector Neural Network Instructions (VNNI). I definitely recommend either of these talks if you want a deep dive into the details of VNNI or Spark analysis.

Another great opportunity to learn about VMware Performance team efforts within the Machine Learning space is to attend the Hands-on-Lab Expert Lead Workshop, “Launch Your Machine Learning Workloads in Minutes on VMware vSphere [ELW-2048-01-EMT_U],” or take the accompanying lab. This is being led by another VMmark ML team member Uday Kurkure along with Staff Global Solutions Consultant Kenyon Hensler. (Sign up for the Expert Lead using the VMworld 2019 mobile application or on my.vmworld.com.)

Our goal after VMworld 2019 US is to continue discussions with partners, customers, and internal teams about how a benchmark like VMmark ML would be most useful. We also hope to complete our integration of Spark within Kubernetes on vSphere and reproduce some of the performance analysis done to date. Stay tuned to the performance blog for additional posts and details as they become available.