Virtualization Performance

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 2 – VMmark

In recently published benchmarks with VMware VMmark, we’ve seen lots of great results with the AMD EPYC 7002 Series (known as 2nd Gen EPYC, or “Rome”) by several of our partners. These results show how well a mixed workload environment with many virtual machines and infrastructure operations like vMotion can perform with new server platforms.

This is the second part of our series covering application performance on VMware vSphere running with AMD 2nd Gen EPYC processors (also see Part 1 of this series). This post focuses on the VMmark benchmark used as an application workload.

We used the following hardware and software in our testbed:

  • AMD 2nd Gen EPYC processors (“Rome”)
  • Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3
  • VMware VMmark 3.1

VMmark

VMmark is a benchmark widely used to study virtualization performance. Many VMware partners have used this benchmark—from the initial release of VMmark 1.0 in 2006, up to the current release of VMmark 3.1 in 2019—to publish official results that you can find on the VMmark 3.x results site. This long history and large set of results can give you an understanding of performance on platforms you’re considering using in your datacenters. AMD EPYC 7002–based systems have done well and, in some cases, have established leadership results. At the publishing of this blog, VMmark results on AMD EPYC 7002 have been published by HPE, Dell | EMC, and Lenovo.

If you look though the details of these published results, you can find some interesting information like disclosed configuration options or settings that might provide performance improvements to your environment. For example, in the details of a Dell |EMC result, you can find that the server BIOS setting Numa Nodes Per Socket (NPS) was set to 4. And in the HPE submission, you can find that the server BIOS setting Last Level Cache (LLC) as NUMA Node was set to enabled. This is also referred to as CCX as NUMA because each CCX has its own LLC. (CCX is described in the following section.)

We put together a VMmark test setup using AMD EPYC Rome 2-socket servers to evaluate performance of these systems internally. This gave us the opportunity to see how the NPS and CCX as NUMA BIOS settings affected performance for this workload.

In the initial post of this series, we looked at the effect of the NPS BIOS settings on a database-specific workload and described the specifics of those settings. We also provided a brief overview and links to additional related resources. This post builds on that one, so we highly recommend reading the earlier post if you’re not already familiar with the BIOS NPS settings and the details for the AMD EPYC 7002 series.

AMD EPYC CCX as NUMA

Each EPYC processor is made up of up to 8 Core Complex Dies (CCDs) that are connected by AMD’s Infinity fabric (figure 1). Inside each CCD, there are two Core Complexes (CCXs) that each have their own LLC cache shown as 16M L3 cache in figure 2. These diagrams (the same ones from the previous blog post) are helpful in illustrating these aspects of the chips.

Figure 1. Logical diagram of AMD EPYC Rome processor

Figure 2. Logical diagram of CCX

The CCX as NUMA or LLC as NUMA BIOS settings can be configured on most of the AMD EPYC 7002 Series processor–based servers. The specific name of the setting will be slightly different for different server vendors. When enabled, the server will present the four cores that share each L3 cache as a NUMA node. In the case of the EPYC 7742 (Rome) processors used in our testing, there are 8 CCDs that each have 2 CCXs for a total 16 CCXs per processor. With CCX as NUMA enabled, each processor is presented as 16 NUMA nodes, with 32 NUMA nodes total for our 2-socket server. This is quite different from the default setting of 1 NUMA node per socket for a total of 2 NUMA nodes for the host.

This setting has some potential to improve performance by exposing the architecture of the processors as individual NUMA nodes, and one of the VMmark result details indicated that it might have been a factor in improving the performance for the benchmark.

VMmark 3 Testing

For this performance study, we used as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processors and 1 TB of memory
  • 1 Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3 installed on a local NVMe disk
  • VMmark 3.1—we ran all tests with 14 VMmark tiles: that’s the maximum number of tiles that the default configuration could handle without failing quality of service (QoS). For more information about tiles, see “Unique Tile-Based Implementation” on the VMmark product page.

We tested the default cases first: CCX as NUMA turned off and Numa Per Socket (NPS) set to 1. We then tested the configurations of NPS 2 and 4, both with and without CCX as NUMA enabled.  Figure 3 below shows the results with the VMmark score and the average host CPU utilization.  VMmark scores are reported relative to the score achieved by the default settings of NPS 1 and CCX as NUMA turned off.

From a high level, the VMmark score and the overall performance of the benchmark does not change a large amount across all of the test cases. These fall within the 1-2% run-to-run variation we see with this benchmark and can be considered equivalent scores. We saw the lowest results with NPS 2 and NPS 4 settings. These results showed 3% and 5% reductions respectively, indicating that those settings don’t give good performance in our environment.

Figure 3. VMmark 3 performance on AMD 2nd Gen EPYC with NPS and CCX as NUMA Settings

We observed one clear trend in the results: a lower CPU utilization in the 6% to 8% range with CCX as NUMA enabled.  This shows that there are some small efficiency gains with CCX as NUMA in our test environment. We didn’t see a significant improvement in overall performance due to these efficiency gains; however, this might allow an additional tile to run on the cluster (we didn’t test this).  While some small gains in efficiency are possible with these settings, we don’t recommend moving away from the default settings for general-purpose, mixed-workload environments.  Instead, you should evaluate these advanced settings for specific applications before using them.