Home > Blogs > VMware VROOM! Blog > Tag Archives: EPYC

Tag Archives: EPYC

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 3 – View Planner Virtual Desktops

By Rahul Garg, Milind Inamdar, and Qasim Ali

In the first two parts of this series, we looked at the performance of a SQL Server database using the DVD Store 3 benchmark and a mixed workload environment with the VMmark benchmark running on the AMD 2nd Gen EPYC processors and VMware vSphere. With the third part of the series, we look at virtual desktop performance on this platform with the VMware View Planner benchmark.

View Planner

View Planner is VMware’s internally developed benchmark for virtual desktops, or virtual desktop infrastructure (VDI). It simulates users doing common tasks with popular applications like word processors, spreadsheets, and web browsers.  View Planner measures how many virtual desktops can be run while maintaining response times within established limits.

Complete details about View Planner, including how to set up and run it, are documented in the VMware View Planner User Guide. We used version 4.3.

Screenshot of the View Planner 4.3 user guide

Figure 1. VMware View Planner 4.3 User Guide 

AMD EPYC

AMD 2nd Generation EPYC processors have an architecture and associated NUMA BIOS settings that we explored in the initial two posts in this series. Here, we look at how NPS and CCX as NUMA settings (defined below) affect the performance of a VDI environment with View Planner.

Recap: What are CCDs and CCXs?

In the previous AMD EPYC blogs, we included figures that show the AMD 2nd Generation EPYC processors have up to 8 Core Complex Dies (CCDs) connected via Infinity Fabric. Each CCD is made up of two Core Complexes (CCXs). For reference, we’ve included the following diagrams again.

AMD EPYC (Rome) logical diagram

Figure 2. Logical diagram of AMD 2nd Generation EPYC processor

Two Core Complexes (CCXs) comprise each CCD.  A CCX is up to four cores sharing an L3 cache, as shown in this additional logical diagram from AMD for a CCD, where the orange line separates the two CCXs.

Logical diagram of CCX

Figure 3. Logical diagram of CCD/CCX

What is NUMA Per Socket?

NUMA Per Socket (NPS) is a BIOS setting on the ESXi host that allows the processor to be partitioned into multiple NUMA domains per socket. Each domain is a grouping of CCDs and their associated memory. In the default case of NPS1, memory is interleaved across all the CCDs as a single NUMA node. With NPS2, the CCDs and their memory are split into two groups. Finally, with NPS4, the processor is partitioned into 4 nodes with 2 CCDs each and their associated memory.

What is CCX as NUMA?

CCX as NUMA presents each CCX as a NUMA domain. When enabled, this causes each AMD 2nd Generation EPYC processor used in our tests to have 16 NUMA nodes, for a total of 32 across both sockets.

Performance Tests

For this performance study, we used the following as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processor, 2TB of memory, and 4.66TB of storage
  • Windows 10, version 1909: We installed this as the desktop VM and configured it with 2 vCPUs, 4GB of memory and 50GB of disk space.
  • VMware vSphere 6.7 U3
  • View Planner 4.3: Legacy Mode enabled.
  • Horizon View 7.11: We used Instant Clones for VDI setup. Refer to VMware Horizon 7 Instant-Clone Desktops and RDSH Servers for details.

For all tests, the number of desktop VMs were increased in successive test runs until we hit the quality-of-service (QoS) limit of View Planner.

We tested NPS1, 2, and 4 both with and without CCX as NUMA enabled. The default settings of NPS1 and CCX as NUMA disabled was the baseline for the tests: relative results are shown for all tests in the comparison.

Because of the large amount of CPU capacity on this ESXi host with 128 cores across two sockets, a large number of the Windows 10 desktops were able to be hosted. With each desktop VM assigned 4GB of RAM, the total amount of RAM assigned to VMs was more than the 2TB of physical RAM in the ESXi host. This resulted in what we call a system that was memory overcommitted.

Which configurations achieve the most performant number of desktops within QoS?

The results below show (in green) NPS1, NPS2, NPS4 without CCX as NUMA are all within a couple of percent of the number of desktops supported. The results also show (in blue) about a 7% decline in the number of desktops supported with CCX as NUMA.

Figure 3. View Planner VM consolidation with AMD 2nd Gen EPYC with NPS and CCX as NUMA settings

Figure 4. View Planner VM consolidation with AMD 2nd Gen EPYC with NPS and CCX as NUMA settings

We see this performance degradation when CCX as NUMA is enabled for the following reason: View Planner is a VDI workload that benefits from ESXi’s memory page sharing when the environment is memory overcommitted as we were on these tests. Memory page sharing works in instant clones by sharing the memory of the running parent VM from which the instant clone is created. However, if the instant clone is running on a different NUMA node than the parent VM, page sharing may be lost. By default, ESXi only shares pages within a NUMA node because remote memory access is often costly. With CCX as NUMA, 32 NUMA nodes are exposed to ESXi, restricting page sharing within a NUMA node. The result is a significant loss of memory page sharing–related savings that ends up in memory ballooning.

To confirm the reason for the ~7% performance degradation, we performed internal testing that allowed us to enable inter-NUMA page sharing. With this test configuration, there was no memory ballooning, and we gained the ~7% performance back.

So, what is the answer? The default is best.

The View Planner tests show that the ESXi hosts’ out-of-the-box settings of NPS 1 and CCX as NUMA disabled provide the best performance for VDI workloads on vSphere 6.7 U3 and AMD 2nd Generation EPYC processors.

Commentary about CCX as NUMA

CCX as NUMA exposes each CCX as its own NUMA domain to the hypervisor or operating system. This is a false representation of the NUMA architecture that is provided as an option to aid the hypervisor or OS scheduler in certain scenarios that are currently not highly optimized for the AMD architecture—that is, there are multiple Last Level Caches (LLCs) per NUMA node.

This false representation can skew certain assumptions the hypervisors or operating systems are built upon related to NUMA architecture (for example, memory page sharing as we described above). We believe once the hypervisor and OS schedulers are fully optimized, the option of CCX as NUMA may not be needed.

 

 

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 2 – VMmark

In recently published benchmarks with VMware VMmark, we’ve seen lots of great results with the AMD EPYC 7002 Series (known as 2nd Gen EPYC, or “Rome”) by several of our partners. These results show how well a mixed workload environment with many virtual machines and infrastructure operations like vMotion can perform with new server platforms.

This is the second part of our series covering application performance on VMware vSphere running with AMD 2nd Gen EPYC processors (also see Part 1 of this series). This post focuses on the VMmark benchmark used as an application workload.

We used the following hardware and software in our testbed:

  • AMD 2nd Gen EPYC processors (“Rome”)
  • Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3
  • VMware VMmark 3.1

VMmark

VMmark is a benchmark widely used to study virtualization performance. Many VMware partners have used this benchmark—from the initial release of VMmark 1.0 in 2006, up to the current release of VMmark 3.1 in 2019—to publish official results that you can find on the VMmark 3.x results site. This long history and large set of results can give you an understanding of performance on platforms you’re considering using in your datacenters. AMD EPYC 7002–based systems have done well and, in some cases, have established leadership results. At the publishing of this blog, VMmark results on AMD EPYC 7002 have been published by HPE, Dell | EMC, and Lenovo.

If you look though the details of these published results, you can find some interesting information like disclosed configuration options or settings that might provide performance improvements to your environment. For example, in the details of a Dell |EMC result, you can find that the server BIOS setting Numa Nodes Per Socket (NPS) was set to 4. And in the HPE submission, you can find that the server BIOS setting Last Level Cache (LLC) as NUMA Node was set to enabled. This is also referred to as CCX as NUMA because each CCX has its own LLC. (CCX is described in the following section.)

We put together a VMmark test setup using AMD EPYC Rome 2-socket servers to evaluate performance of these systems internally. This gave us the opportunity to see how the NPS and CCX as NUMA BIOS settings affected performance for this workload.

In the initial post of this series, we looked at the effect of the NPS BIOS settings on a database-specific workload and described the specifics of those settings. We also provided a brief overview and links to additional related resources. This post builds on that one, so we highly recommend reading the earlier post if you’re not already familiar with the BIOS NPS settings and the details for the AMD EPYC 7002 series.

AMD EPYC CCX as NUMA

Each EPYC processor is made up of up to 8 Core Complex Dies (CCDs) that are connected by AMD’s Infinity fabric (figure 1). Inside each CCD, there are two Core Complexes (CCXs) that each have their own LLC cache shown as 16M L3 cache in figure 2. These diagrams (the same ones from the previous blog post) are helpful in illustrating these aspects of the chips.

Figure 1. Logical diagram of AMD EPYC Rome processor

Figure 2. Logical diagram of CCX

The CCX as NUMA or LLC as NUMA BIOS settings can be configured on most of the AMD EPYC 7002 Series processor–based servers. The specific name of the setting will be slightly different for different server vendors. When enabled, the server will present the four cores that share each L3 cache as a NUMA node. In the case of the EPYC 7742 (Rome) processors used in our testing, there are 8 CCDs that each have 2 CCXs for a total 16 CCXs per processor. With CCX as NUMA enabled, each processor is presented as 16 NUMA nodes, with 32 NUMA nodes total for our 2-socket server. This is quite different from the default setting of 1 NUMA node per socket for a total of 2 NUMA nodes for the host.

This setting has some potential to improve performance by exposing the architecture of the processors as individual NUMA nodes, and one of the VMmark result details indicated that it might have been a factor in improving the performance for the benchmark.

VMmark 3 Testing

For this performance study, we used as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processors and 1 TB of memory
  • 1 Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3 installed on a local NVMe disk
  • VMmark 3.1—we ran all tests with 14 VMmark tiles: that’s the maximum number of tiles that the default configuration could handle without failing quality of service (QoS). For more information about tiles, see “Unique Tile-Based Implementation” on the VMmark product page.

We tested the default cases first: CCX as NUMA disabled and Numa Per Socket (NPS) set to 1. We then tested the configurations of NPS 2 and 4, both with and without CCX as NUMA enabled.  Figure 3 below shows the results with the VMmark score and the average host CPU utilization.  VMmark scores are reported relative to the score achieved by the default settings of NPS 1 and CCX as NUMA disabled.

From a high level, the VMmark score and the overall performance of the benchmark does not change a large amount across all of the test cases. These fall within the 1-2% run-to-run variation we see with this benchmark and can be considered equivalent scores. We saw the lowest results with NPS 2 and NPS 4 settings. These results showed 3% and 5% reductions respectively, indicating that those settings don’t give good performance in our environment.

Figure 3. VMmark 3 performance on AMD 2nd Gen EPYC with NPS and CCX as NUMA Settings

We observed one clear trend in the results: a lower CPU utilization in the 6% to 8% range with CCX as NUMA enabled.  This shows that there are some small efficiency gains with CCX as NUMA in our test environment. We didn’t see a significant improvement in overall performance due to these efficiency gains; however, this might allow an additional tile to run on the cluster (we didn’t test this).  While some small gains in efficiency are possible with these settings, we don’t recommend moving away from the default settings for general-purpose, mixed-workload environments.  Instead, you should evaluate these advanced settings for specific applications before using them.

vSphere 6.7 Update 3 Supports AMD EPYC™ Generation 2 Processors, VMmark Showcases Its Leadership Performance

Two leadership VMmark benchmark results have been published with AMD EPYC™ Generation 2 processors running VMware vSphere 6.7 Update 3 on a two-node two-socket cluster and a four-node cluster. VMware worked closely with AMD to enable support for AMD EPYC™ Generation 2 in the VMware vSphere 6.7 U3 release.

The VMmark benchmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms and has become the standard by which the performance of virtualization platforms is evaluated.

The new AMD EPYC™ Generation 2 performance results can be found here and here.

View all VMmark results
Learn more about VMmark
These benchmark result claims are valid as of the date of writing.