By Rahul Garg, Milind Inamdar, and Qasim Ali
In the first two parts of this series, we looked at the performance of a SQL Server database using the DVD Store 3 benchmark and a mixed workload environment with the VMmark benchmark running on the AMD 2nd Gen EPYC processors and VMware vSphere. With the third part of the series, we look at virtual desktop performance on this platform with the VMware View Planner benchmark.
View Planner is VMware’s internally developed benchmark for virtual desktops, or virtual desktop infrastructure (VDI). It simulates users doing common tasks with popular applications like word processors, spreadsheets, and web browsers. View Planner measures how many virtual desktops can be run while maintaining response times within established limits.
Complete details about View Planner, including how to set up and run it, are documented in the VMware View Planner User Guide. We used version 4.3.
Figure 1. VMware View Planner 4.3 User Guide
AMD 2nd Generation EPYC processors have an architecture and associated NUMA BIOS settings that we explored in the initial two posts in this series. Here, we look at how NPS and CCX as NUMA settings (defined below) affect the performance of a VDI environment with View Planner.
Recap: What are CCDs and CCXs?
In the previous AMD EPYC blogs, we included figures that show the AMD 2nd Generation EPYC processors have up to 8 Core Complex Dies (CCDs) connected via Infinity Fabric. Each CCD is made up of two Core Complexes (CCXs). For reference, we’ve included the following diagrams again.
Figure 2. Logical diagram of AMD 2nd Generation EPYC processor
Two Core Complexes (CCXs) comprise each CCD. A CCX is up to four cores sharing an L3 cache, as shown in this additional logical diagram from AMD for a CCD, where the orange line separates the two CCXs.
Figure 3. Logical diagram of CCD/CCX
What is NUMA Per Socket?
NUMA Per Socket (NPS) is a BIOS setting on the ESXi host that allows the processor to be partitioned into multiple NUMA domains per socket. Each domain is a grouping of CCDs and their associated memory. In the default case of NPS1, memory is interleaved across all the CCDs as a single NUMA node. With NPS2, the CCDs and their memory are split into two groups. Finally, with NPS4, the processor is partitioned into 4 nodes with 2 CCDs each and their associated memory.
What is CCX as NUMA?
CCX as NUMA presents each CCX as a NUMA domain. When enabled, this causes each AMD 2nd Generation EPYC processor used in our tests to have 16 NUMA nodes, for a total of 32 across both sockets.
For this performance study, we used the following as the basis for all tests:
- Two 2-socket systems with AMD EPYC 7742 processor, 2TB of memory, and 4.66TB of storage
- Windows 10, version 1909: We installed this as the desktop VM and configured it with 2 vCPUs, 4GB of memory and 50GB of disk space.
- VMware vSphere 6.7 U3
- View Planner 4.3: Legacy Mode enabled.
- Horizon View 7.11: We used Instant Clones for VDI setup. Refer to VMware Horizon 7 Instant-Clone Desktops and RDSH Servers for details.
For all tests, the number of desktop VMs were increased in successive test runs until we hit the quality-of-service (QoS) limit of View Planner.
We tested NPS1, 2, and 4 both with and without CCX as NUMA enabled. The default settings of NPS1 and CCX as NUMA disabled was the baseline for the tests: relative results are shown for all tests in the comparison.
Because of the large amount of CPU capacity on this ESXi host with 128 cores across two sockets, a large number of the Windows 10 desktops were able to be hosted. With each desktop VM assigned 4GB of RAM, the total amount of RAM assigned to VMs was more than the 2TB of physical RAM in the ESXi host. This resulted in what we call a system that was memory overcommitted.
Which configurations achieve the most performant number of desktops within QoS?
The results below show (in green) NPS1, NPS2, NPS4 without CCX as NUMA are all within a couple of percent of the number of desktops supported. The results also show (in blue) about a 7% decline in the number of desktops supported with CCX as NUMA.
Figure 4. View Planner VM consolidation with AMD 2nd Gen EPYC with NPS and CCX as NUMA settings
We see this performance degradation when CCX as NUMA is enabled for the following reason: View Planner is a VDI workload that benefits from ESXi’s memory page sharing when the environment is memory overcommitted as we were on these tests. Memory page sharing works in instant clones by sharing the memory of the running parent VM from which the instant clone is created. However, if the instant clone is running on a different NUMA node than the parent VM, page sharing may be lost. By default, ESXi only shares pages within a NUMA node because remote memory access is often costly. With CCX as NUMA, 32 NUMA nodes are exposed to ESXi, restricting page sharing within a NUMA node. The result is a significant loss of memory page sharing–related savings that ends up in memory ballooning.
To confirm the reason for the ~7% performance degradation, we performed internal testing that allowed us to enable inter-NUMA page sharing. With this test configuration, there was no memory ballooning, and we gained the ~7% performance back.
So, what is the answer? The default is best.
The View Planner tests show that the ESXi hosts’ out-of-the-box settings of NPS 1 and CCX as NUMA disabled provide the best performance for VDI workloads on vSphere 6.7 U3 and AMD 2nd Generation EPYC processors.
Commentary about CCX as NUMA
CCX as NUMA exposes each CCX as its own NUMA domain to the hypervisor or operating system. This is a false representation of the NUMA architecture that is provided as an option to aid the hypervisor or OS scheduler in certain scenarios that are currently not highly optimized for the AMD architecture—that is, there are multiple Last Level Caches (LLCs) per NUMA node.
This false representation can skew certain assumptions the hypervisors or operating systems are built upon related to NUMA architecture (for example, memory page sharing as we described above). We believe once the hypervisor and OS schedulers are fully optimized, the option of CCX as NUMA may not be needed.