Exciting new server platforms based on the second generation of AMD EPYC processors (Rome) have become recently available from many of our hardware partners. The new Rome processors offer up to 64 cores per socket—that’s a big increase over the previous generation of AMD processors. This means that a two-socket server using these processors has 128 cores and 256 logical threads with simultaneous multi-threading (SMT) enabled, making two-socket servers look more like four-socket servers in terms of core counts.
This is the first blog in a series that will take at look at the performance of some different workloads on the AMD EPYC Rome processor on VMware vSphere. Today we’re giving you the results of our tests on Microsoft SQL Server 2019.
The AMD EPYC Rome processor is built with Core Complex Dies (CCDs) connected via Infinity Fabric. In total, there are up to eight CCDs in the EPYC 7002 processor (Rome), as shown in figure 1.
Figure 1. Logical diagram of AMD EPYC Rome processor
Two Core Complexes (CCXs) comprise each CCD. A CCX is up to four cores sharing an L3 cache, as shown in this additional logical diagram from AMD for a CCD, where the orange line separates the two CCXs.
Figure 2. Logical diagram of CCD
The AMD EPYC 7002 series processors in some ways simplify the architecture for many applications, including virtualized and private cloud deployment. There are more details on the EPYC Rome processor as well as a comparison to the previous generation AMD EPYC processors in a great article written by Anandtech.
AMD EPYC 7002 series (Rome) server processors are fully supported for vSphere 6.5 U3, vSphere 6.7 U3, and vSphere 7.0. For all tests in this blog, vSphere 6.7 U3 was used.
The server used for testing here was a two-socket system with AMD EPYC 7742 processors and 1 TB of memory. Storage was an all flash ExtremeIO Fibre Channel array with a 4TB LUN assigned to the test system. vSphere 6.7 U3 was installed on a local NVMe disk and used as the basis for all tests.
Testing with SQL Server 2019
Microsoft SQL Server 2019 is the current version of this popular relational database. It’s widely used by VMware customers and is one of the most commonly used applications on the vSphere platform. It’s a good application to test the performance of both large- and medium-sized virtual machines.
For the test, we used the SQL Server workload of the DVD Store 3 benchmark. It’s an open-source online transaction processing (OLTP) workload that simulates an online store. It uses many common database features such as indexes, foreign keys, stored procedures, and transactions. The workload is measured in terms of orders per minute (OPM), where each order is made up of logging in, browsing the store, reading and rating reviews, adding items to the shopping cart, and purchasing them.
For all tests, the number of worker threads that simulated users were increased in successive test runs until the maximum OPM was achieved and then began to decline, or stay the same, as additional threads are added. At this point, CPU utilization was between 90 and 100 percent.
We created a Windows Server 2019 VM and installed SQL Server 2019 on it. For the later tests this VM was cloned multiple times to be able to quickly scale-out the test setup.
Scale Up Performance of a Monster VM
With such a large number of cores available, it was natural to test how much performance was possible when scaling up to the maximum size of vCPUs per VM (a Monster VM). We configured the scaled up VM with 512 GB of RAM and a DVD Store test database of about 400GB.
We compared the maximum throughput for 64 and 128 vCPU VMs and found good scalability. The 128 vCPU VM achieved 1.86 times the throughput of the 64 vCPU VM. This small fall off in scalability is due to the additional NUMA node, which results in some increased latency. Additionally, the sheer number of cores involved in such large systems caused slightly higher overhead to manage for the vSphere scheduler.
Figure 3. Scale-up performance from 64 vCPUs to 128 vCPUs for a single VM.
Scale-Out Performance of Multiple VMs
To test the scale-out performance of a vSphere environment, we cloned the SQL Server 2019 VM until we had eight. We configured each VM to have 16 vCPUs with 128 GB of RAM. This allowed us to have a maximum number of active vCPUs in the test to be equal to the 128 cores in the server. Additionally, we configured the size of the DVD Store test database to be about 100GB. We did this to scale the workload to the size of the VM.
The results below show that the total throughput continues to increase as the number of VMs is increased to eight. In total, the eight VMs were able to produce slightly over 6x what a single VM could achieve.
Figure 4. As we scaled out the 16-vCPU VM from 1 to 2, 4, and 8 VMs, we observed the eight VMs were able to produce slightly over 6x what a single VM could achieve.
Optimizing Performance Opportunities with AMD EPYC Rome
As mentioned at the beginning of this post, AMD EPYC Rome processors used in this test are made up of eight CCD modules, each with 8 cores. Within each CCD there are two CCXs that share an L3 processor cache. Each CCD has an associated memory controller. With default settings, all eight CCDs and their memory controllers act as one NUMA node with memory access interleaved across all memory controllers.
There is an option in the BIOS settings to partition the processor into multiple NUMA domains per socket. This partitioning is based on grouping the CCDs and their associated memory controllers. The option is referred to as NUMA per socket or NPS, and the default is 1. This means that there is one NUMA node per socket. The other options are to configure it to 2 or 4. In the case where NPS is set to 4, there are 4 NUMA nodes per socket, with each NUMA node having 2 CCDs and their associated memory.
If the VM sizes allow for them to align with the NPS setting in terms of cores and memory, then there is the opportunity for performance gains with some workloads. In the specific case of our scale-out performance testing that we looked at above, there were 8 SQL Server VMs with 16 vCPUs and 128 GB of RAM each. This lines up with an NPS 4 setting – 1 VM per NUMA node with 16 vCPUs matching 16 cores per NUMA node. Additionally, there is 128 GB of RAM for each VM as well as 128 GB of RAM in each NPS 4–based NUMA node for our system with 1TB of RAM.
When tested, this configuration of VMs with such nice alignment resulted in a 7.8% gain in throughput for the NPS 4 setting over the default of NPS 1. NPS 2 showed only a negligible gain of 1%.
Figure 5. Because of good alignment, the NPS 4 setting gained 7.8% in throughput over the NPS 1 setting, compared to the NPS 2 setting, which showed only a 1% performance improvement.
It is important to note that not all workloads and VMs will gain 8% or even any performance just by using the NPS 4 setting. The performance gain in this case is due to the clean alignment of the VMs with NPS 4. Compared to NPS 1, where multiple VMs were probably not confined across their own set of caches and were stepping on other VM’s cache usage. In this specific scenario with NPS 4, each VM basically has its own NUMA node with its own set of L3 processor caches and lower memory latency due to the interleaving across only the local memory for the CCDs being used. In circumstances where VM size is uniform and nicely aligns with one of these NPS settings, it is possible to obtain some modest performance gains. Please use these settings with caution and test their effect before using them in production.