Virtualization Performance

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 4 – STREAM and Java EE

By Jim Hsu, Todd Muirhead, and Qasim Ali

As part of our continuing series on the performance of different workloads on AMD 2nd Generation EPYC (Rome) servers using VMware vSphere 6.7 U3, this article takes a look at memory bandwidth and latency performance using the STREAM benchmark and a Java enterprise edition (Java EE) server workload. Please see the previous blogs on database, VMmark, and virtual desktop for lots of great info on what to expect when running those types of workloads in a virtualized environment using VMware vSphere and AMD Rome-based servers.

Memory bandwidth and latency performance are important for many applications and are key factors in overall system performance. This blog presents best practices on measuring and optimizing the memory performance for virtual machines running on the AMD Rome-based platform.

AMD Rome Key Architecture Features

AMD 2nd Generation EPYC (Rome) processors have an architecture and associated NUMA BIOS settings that we explored in the initial three posts in this series. Here, we look at how the NUMA per socket (NPS) setting affects the memory bandwidth performance.

The AMD Rome-based  processors have up to 8 core complex dies (CCDs) connected via Infinity Fabric with central I/O and memory die. Each CCD is made up of 2 core complexes (CCXs). We have included diagrams for those in the previous entries in this blog series.

NUMA per socket (NPS) is a BIOS setting that allows the processor to be partitioned into multiple NUMA domains per socket. Each domain is a grouping of CCDs and their associated memory. In the default case of NPS 1, memory is interleaved across all the CCDs as a single NUMA node. With NPS 2, the CCDs and their memory are split into 2 groups. Finally, with NPS 4, the processor is partitioned into 4 nodes with 2 CCDs each and their associated memory.

STREAM

STREAM is a simple synthetic benchmark that measures sustainable memory bandwidth. It measures four different types of operations: Copy, Scale, Sum, and Triad. There is no arithmetic in the Copy test, which usually results in measuring the highest bandwidth. We used only the Copy bandwidth measurements in the following test results. For more details about STREAM, please see its documentation.  The Copy test consists of multiple repetitions of the Copy kernel, and we chose the best result of 20 trials.

Memory-Intensive Enterprise Java Workload

To evaluate the impact of memory latency on workload performance, we chose a very memory-intensive Java EE workload that is highly sensitive to latency. The workload consists of multiple Java virtual machines (JVMs) that are divided into groups. There are 8 groups of JVMs in this study.

Testing and Results

For the vSphere host, we used a server with two AMD EPYC 7742 processors and 1TB of RAM (16 64GB DIMMs at 3200 MT/s) running vSphere 6.7 U3, with simultaneous multi-threading (SMT) enabled.

Tests with a single 128-vCPU VM on the host with 128 cores showed that NPS 4 provided the best memory bandwidth (BW as shown in the following chart) on the STREAM Copy test with 339 gigabytes per second (GB/s).

Note: The cores per socket setting for the VM was set to 4 to match 4 cores per CCX in the AMD processor. STREAM_ARRAY_SIZE was set to 800 million.

In this chart, the orange bars represent the tests where the number of STREAM threads is to set to 32, which is the same as the number of CCXs in the system. This means that there are 2 threads per memory channel. We tested with different numbers of threads, and 1 thread per CCX achieved the highest bandwidth.

The blue bars represent the tests where the number of STREAM threads is set to 128—the same as the virtual CPU count for the VM—which also matches the number of cores in the server.

In both test cases, NPS 4 is the best performing because the average memory latency is lower. This is because with NPS 4, the memory interleaving is only across 2 of the nearest memory channels instead of 8 as with NPS 1 or 4 with NPS 2, and the average memory latency is the lowest.

The other key takeaway is that by having more threads active than there are CCXs, the memory throughput can drop due to competition between processes for the same resource.

The above tests are done with a single VM of 128 vCPUs. Additional tests show that we can achieve similar total system bandwidth with multiple, smaller VMs. For example, with 8 VMs of 16 vCPUs each, peak system-wide STREAM Copy bandwidth is 346 GB/s under the NPS 4 configuration.

We used a Java EE server workload (shown as Enterprise Java Workload in the following chart) to simulate memory-intensive application scenarios. The chart below shows the performance (throughput ratio) of the Java workload on a 128-vCPU VM, under different NPS configurations. The 8 groups of JVMs in this workload were assigned to the NUMA nodes on a round-robin basis. For example, in NPS 2, there were 4 NUMA nodes in the system, and each node contained 2 groups.

Note: The numactl command is used to assign JVMs to specific nodes and force local memory allocation.

The result shows that NPS 4 performance is best, and NPS 1 has the lowest throughput. The workload can take advantage of lower average memory latency and better L3 sharing with the NPS 4 setting.

For the NPS 4 configuration, there were 8 NUMA nodes and 8 JVM groups. The Java workload performance trend matches that of STREAM, which shows that NPS 4 would achieve peak bandwidth/best latency for such memory hungry applications.

Important: If the vSphere host has a mix of memory hungry and other types of applications, you should make sure that NPS 4 does not negatively affect the other applications. There always must be a balance between optimizing the per core bandwidth by memory interleaving across all channels or by minimizing the channel interleaving to get maximum peak system throughput.

Note: You must monitor and tune the NUMA node assignment to ensure optimal performance.

Conclusion

Using the NPS 4 setting provides the best performance on the STREAM benchmark because it allows for lower latency with memory interleaving across fewer channels/nodes. Getting the absolute peak performance is obtained by tuning the STREAM benchmark to have 1 thread per CCX.

For NUMA-aware applications that can take advantage of lower memory latency, NPS 4 is likely to offer better performance.

Reference

  1. AMD Rome – is it for real? Architecture and initial HPC performance