Home > Blogs > VMware VROOM! Blog > Author Archives: James Zubb

Author Archives: James Zubb

AMD 2nd Gen EPYC (Rome) Application Performance on vSphere Series: Part 2 – VMmark

In recently published benchmarks with VMware VMmark, we’ve seen lots of great results with the AMD EPYC 7002 Series (known as 2nd Gen EPYC, or “Rome”) by several of our partners. These results show how well a mixed workload environment with many virtual machines and infrastructure operations like vMotion can perform with new server platforms.

This is the second part of our series covering application performance on VMware vSphere running with AMD 2nd Gen EPYC processors (also see Part 1 of this series). This post focuses on the VMmark benchmark used as an application workload.

We used the following hardware and software in our testbed:

  • AMD 2nd Gen EPYC processors (“Rome”)
  • Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3
  • VMware VMmark 3.1

VMmark

VMmark is a benchmark widely used to study virtualization performance. Many VMware partners have used this benchmark—from the initial release of VMmark 1.0 in 2006, up to the current release of VMmark 3.1 in 2019—to publish official results that you can find on the VMmark 3.x results site. This long history and large set of results can give you an understanding of performance on platforms you’re considering using in your datacenters. AMD EPYC 7002–based systems have done well and, in some cases, have established leadership results. At the publishing of this blog, VMmark results on AMD EPYC 7002 have been published by HPE, Dell | EMC, and Lenovo.

If you look though the details of these published results, you can find some interesting information like disclosed configuration options or settings that might provide performance improvements to your environment. For example, in the details of a Dell |EMC result, you can find that the server BIOS setting Numa Nodes Per Socket (NPS) was set to 4. And in the HPE submission, you can find that the server BIOS setting Last Level Cache (LLC) as NUMA Node was set to enabled. This is also referred to as CCX as NUMA because each CCX has its own LLC. (CCX is described in the following section.)

We put together a VMmark test setup using AMD EPYC Rome 2-socket servers to evaluate performance of these systems internally. This gave us the opportunity to see how the NPS and CCX as NUMA BIOS settings affected performance for this workload.

In the initial post of this series, we looked at the effect of the NPS BIOS settings on a database-specific workload and described the specifics of those settings. We also provided a brief overview and links to additional related resources. This post builds on that one, so we highly recommend reading the earlier post if you’re not already familiar with the BIOS NPS settings and the details for the AMD EPYC 7002 series.

AMD EPYC CCX as NUMA

Each EPYC processor is made up of up to 8 Core Complex Dies (CCDs) that are connected by AMD’s Infinity fabric (figure 1). Inside each CCD, there are two Core Complexes (CCXs) that each have their own LLC cache shown as 16M L3 cache in figure 2. These diagrams (the same ones from the previous blog post) are helpful in illustrating these aspects of the chips.

Figure 1. Logical diagram of AMD EPYC Rome processor

Figure 2. Logical diagram of CCX

The CCX as NUMA or LLC as NUMA BIOS settings can be configured on most of the AMD EPYC 7002 Series processor–based servers. The specific name of the setting will be slightly different for different server vendors. When enabled, the server will present the four cores that share each L3 cache as a NUMA node. In the case of the EPYC 7742 (Rome) processors used in our testing, there are 8 CCDs that each have 2 CCXs for a total 16 CCXs per processor. With CCX as NUMA enabled, each processor is presented as 16 NUMA nodes, with 32 NUMA nodes total for our 2-socket server. This is quite different from the default setting of 1 NUMA node per socket for a total of 2 NUMA nodes for the host.

This setting has some potential to improve performance by exposing the architecture of the processors as individual NUMA nodes, and one of the VMmark result details indicated that it might have been a factor in improving the performance for the benchmark.

VMmark 3 Testing

For this performance study, we used as the basis for all tests:

  • Two 2-socket systems with AMD EPYC 7742 processors and 1 TB of memory
  • 1 Dell EMC XtremIO all-flash array
  • VMware vSphere 6.7 U3 installed on a local NVMe disk
  • VMmark 3.1—we ran all tests with 14 VMmark tiles: that’s the maximum number of tiles that the default configuration could handle without failing quality of service (QoS). For more information about tiles, see “Unique Tile-Based Implementation” on the VMmark product page.

We tested the default cases first: CCX as NUMA disabled and Numa Per Socket (NPS) set to 1. We then tested the configurations of NPS 2 and 4, both with and without CCX as NUMA enabled.  Figure 3 below shows the results with the VMmark score and the average host CPU utilization.  VMmark scores are reported relative to the score achieved by the default settings of NPS 1 and CCX as NUMA disabled.

From a high level, the VMmark score and the overall performance of the benchmark does not change a large amount across all of the test cases. These fall within the 1-2% run-to-run variation we see with this benchmark and can be considered equivalent scores. We saw the lowest results with NPS 2 and NPS 4 settings. These results showed 3% and 5% reductions respectively, indicating that those settings don’t give good performance in our environment.

Figure 3. VMmark 3 performance on AMD 2nd Gen EPYC with NPS and CCX as NUMA Settings

We observed one clear trend in the results: a lower CPU utilization in the 6% to 8% range with CCX as NUMA enabled.  This shows that there are some small efficiency gains with CCX as NUMA in our test environment. We didn’t see a significant improvement in overall performance due to these efficiency gains; however, this might allow an additional tile to run on the cluster (we didn’t test this).  While some small gains in efficiency are possible with these settings, we don’t recommend moving away from the default settings for general-purpose, mixed-workload environments.  Instead, you should evaluate these advanced settings for specific applications before using them.

Using VMmark 3 as a Performance Analysis Tool

VMmark was originally developed to fill the need for a server consolidation benchmark for a rapidly changing datacenter that was becoming increasingly dominated by virtualization.  The design of VMmark, which is a collection of workloads, gives us the ability to quickly change workload parameters to modify the behavior of the entire benchmark. This allows us to use VMmark to exercise technologies that were not available at the time the benchmark was designed. The VMmark 3 run rules provide for academic or research results publication using a modified version of the benchmark.

VMmark 3 was designed in 2015 when the memory size of a typical high-end 2 socket server was 768 GB.  Each VMmark 3 tile was configured to use 156 GB of memory, allowing multiple tiles to be run on each server.  A new technology, Intel Optane DC Persistent Memory, now allows up to 3 TB of memory in a 2 socket server, with plans to increase that even further.  Testing the performance of this technology with an unmodified version of VMmark 3 wouldn’t be easy as we’d saturate CPU resources long before we could fully exercise this large amount of memory.  Thankfully the flexible nature of VMmark allows us to modify it to consume significantly more memory with minimal changes in CPU usage.

The two primary VMmark workloads are Weathervane and DVD Store.  Each can be modified to consume more memory.  Weathervane, as configured for VMmark 3, uses 14 VMs.  Thus while it would be possible to modify this application, doing so would be a time-consuming process.  We therefore decided to look at DVD Store, which uses only four VMs.  Most of the work is done in the DVD Store database VM which was our target for modification.

Determining the best configuration for DVD Store to utilize a larger amount of memory required multiple iterations of testing.   We modified one test parameter of the DVD Store workload, and then examined the results to determine the effect on the VMmark tile. We were looking for larger memory usage with a minimal increase in CPU usage so that we could exercise the larger memory configuration without requiring additional CPUs. The following table lists the default configuration and the variables we changed:

Parameter Default Configurations Tried
VM Memory Size 32 GB 128, 250 and 385 GB
Think Time 1 second 0.5, 0.9, 1.25, and 1.5 seconds
Number of Threads 24 36 and 48
Number of Searches 3 5, 7, and 9
Batch Search Size 3 5, 7, and 9
Database Size 100 GB 300 and 500 GB

The final configuration that we determined to have the most increased memory usage while keeping the CPU usage moderate was 250 GB DS3DB VM memory size, 1.5 seconds think time, and 300 GB database size.  All other parameters were kept at the default.

The following table lists the CPU and memory utilization of the default configuration and the “increased memory” configuration.

Configuration CPU Utilization Memory Utilization
Default 26.3 126 GB
Increased Memory 24.1 350 GB

We were able to almost triple the memory consumption of a single VMmark tile without increasing the CPU usage. Using this “increased memory” configuration for VMmark we can now see the effect of the additional memory provided by Intel Optane DC Persistent Memory in Memory Mode.

More detailed information about this configuration and the methodology used to refine it can be found in the Intel Optane DC Persistent Memory whitepaper.  Detailed instructions to configure VMmark 3 to increase the memory footprint can be obtained by emailing the VMmark team at vmmark-info@vmware.com.  We encourage you to experiment with VMmark under academic rules for your own studies and to let us know if you have any questions.

 

Virtual SAN 6.0 Performance with VMware VMmark

Virtual SAN is a storage solution that is fully integrated with VMware vSphere. Virtual SAN leverages flash technology to cache data and improve its access time to and from the disks. We used VMware’s VMmark 2.5 benchmark to evaluate the performance of running a variety of tier-1 application workloads together on Virtual SAN 6.0.

VMmark is a multi-host virtualization benchmark that uses varied application workloads and common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of virtual machines running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview.

 

Testing Methodology

VMmark 2.5 requires two datastores for its Storage vMotion workload, but Virtual SAN creates only a single datastore. A Red Hat Enterprise Linux 7 virtual machine was created on a separate host to act as an iSCSI target to serve as the secondary datastore. Linux-IO Target (LIO) was used for this.

 

Configuration

Systems Under Test 8x Supermicro SuperStorage SSG-2027R-AR24 servers
CPUs (per server) 2x Intel Xeon E5-2670 v2 @ 2.50 GHz
Memory (per server) 256 GiB
Hypervisor VMware vSphere 5.5 U2 and vSphere 6.0
Local Storage (per server) 3x 400GB Intel SSDSC2BA4012x 900GB 10,000 RPM WD Xe SAS drives
Benchmarking Software VMware VMmark 2.5.2

 

Workload Characteristics

Storage performance is often measured in IOPS, or I/Os per second. Virtual SAN is a storage technology, so it is worthwhile to look at how many IOPS VMmark is generating.  The most disk-intensive workloads within VMmark are DVD Store 2 (also known as DS2), an E-Commerce workload, and the Microsoft Exchange 2007 mail server workload. The graphs below show the I/O profiles for these workloads, which would be identical regardless of storage type.

 Figure1

The DS2 database virtual machine shows a fairly balanced I/O profile of approximately 55% reads and 45% writes.

Microsoft Exchange, on the other hand, has a very write-intensive load, as shown below.

Figure2

Exchange sees nearly 95% writes, so the main benefit the SSDs provide is to serve as a write buffer.

The remaining application workloads have minimal disk I/Os, but do exert CPU and networking loads on the system.

 

Results

VMmark measures both the total throughput of each workload as well as the response time.  The application workloads consist of Exchange, Olio (a Java workload that simulates Web 2.0 applications and measures their performance), and DVD Store 2. All workloads are driven at a fixed throughput level.  A set of workloads is considered a tile.  The load is increased by running multiple tiles.  With Virtual SAN 6.0, we could run up to 40 tiles with acceptable quality of service (QoS). Let’s look at how each workload performed with increasing the number of tiles.

DVD Store

There are 3 webserver frontends per DVD Store tile in VMmark.  Each webserver is loaded with a different profile.  One is a steady-state workload, which runs at a set request rate throughout the test, while the other two are bursty in nature and run a 3-minute and 4-minute load profile every 5 minutes.  DVD Store throughput, measured in orders per minute, varies depending on the load of the server. The throughput will decrease once the server becomes saturated.

Figure3

For this configuration, maximum throughput was achieved at 34 tiles, as shown by the graph above.  As the hosts become saturated, the throughput of each DVD Store tile falls, resulting in a total throughput decrease of 4% at 36 tiles. However, the benchmark still passes QoS at 40 tiles.

Olio and Exchange

Unlike DVD Store, the Olio and Exchange workloads operate at a constant throughput regardless of server load, shown in the table below:

Workload Simulated Users Load per Tile
Exchange 1000 320-330 Sendmail actions per minute
Olio 400 4500-4600 operations per minute

 

At 40 tiles the VMmark clients are sending over ~12,000 mail messages per minute and the Olio webservers served ~180,000 requests per minute.

As the load increases, the response time of Exchange and Olio increases, which makes them a good demonstration of the end-user experience at various load levels. A response time of over 500 milliseconds is considered to be an unacceptable user experience.

Figure4

As we saw with DVD Store, performance begins to dramatically change after 34 tiles as the cluster becomes saturated.  This is mostly seen in the Exchange response time.  At 40 tiles, the response time is over 300 milliseconds for the mailserver workload, which is still within the 500 millisecond threshold for a good user experience. Olio has a smaller increase in response time, since it is more processor intensive.  Exchange has a dependence on both CPU and disk performance.

Looking at Virtual SAN performance, we can get a picture of how much I/O is served by the storage at these load levels.  We can see that reads average around 2000 read I/Os per second:

Figure5

The Read Cache hit rate is 98-99% on all the hosts, so most of these reads are being serviced by the SSDs. Write performance is a bit more varied.

Figure6

We see a range of 5,000-10,000 write IOPS per node due to the write-intensive Exchange workload. Storage is nowhere close to saturation at these load levels. The magnetic disks are not seeing much more than 100 I/Os per second, while the SSDs are seeing about 3,000 – 6,000 I/Os per second. These disks should be able to handle at least 10x this load level. The real bottleneck is in CPU usage.

Looking at the CPU usage of the cluster, we can see that the usage levels out at 36 tiles at about 84% used.  There is still some headroom, which explains why the Olio response times are still very acceptable.

Figure7

As mentioned above, Exchange performance is dependent on both CPU and storage. The additional CPU requirements that Virtual SAN imposes on disk I/O causes Exchange to be more sensitive to server load.

 

Performance Improvements in Virtual SAN 6.0 (vs. Virtual SAN 5.5)

The Virtual SAN 6.0 release incorporates many improvements to CPU efficiency, as well as other improvements. This translates to increased performance for VMmark.

VMmark performance increased substantially when we ran the tests with Virtual SAN 6.0 as opposed to Virtual SAN 5.5. The Virtual SAN 5.5 tests failed to pass QoS beyond 30 tiles, meaning that at least one workload failed to meet the application latency requirement.  During the Virtual SAN 5.5 32-tile tests, one or more Exchange clients would report a Sendmail latency of over 500ms, which is determined to be a QoS failure.  Version 6.0 was able to achieve passing QoS at up to 40 tiles.

Figure8

Not only were more virtual machines able to be supported on Virtual SAN 6.0, but the throughput of the workloads increased as well.  By comparing the VMmark score (normalized to 20-tile Virtual SAN 5.5 results) we can see the performance improvement of Virtual SAN 6.0.

Figure9

Virtual SAN 6.0 achieved a performance improvement of 24% while supporting 33% more virtual machines.

 

Conclusion

Using VMmark, we are able to run a variety of workloads to simulate applications in a production environment.  We were able to demonstrate that Virtual SAN is capable of achieving good performance running heterogeneous real world applications.  The cluster of 8 hosts presented here show good performance in VMmark through 40 tiles.  This is ~12,000 mail messages per minute sent through Exchange, ~180,000 requests per minute served by the Olio webservers, and over 200,000 orders per minute processed on the DVD Store database.  Additionally, we were able to measure substantial performance improvements over Virtual SAN 5.5 using Virtual SAN 6.0.

 

Reducing Power Consumption in the vSphere 5.5 Datacenter

Today’s virtualized datacenters consist of several servers connected to shared storage, and this configuration has been necessary to enable the flexibility that virtualization provides and still allow for high performance. However, the power consumption of this setup is a major concern because shared storage can consume as much as 2-3x the power of a single, mid-ranged server. In this blog, we look at the performance impact of replacing shared storage with local disks and PCIe flash storage in a vSphere 5.5 datacenter to save power.

We leverage two innovative vSphere features in this performance test:

  • Unified live migration, first introduced with vSphere 5.1, removes the shared storage requirement for vMotion and allows combining traditional vMotion and Storage vMotion into one operation. This combined live migration copies both the virtual machine’s memory and storageover the network to the destination vSphere host. This feature offers administrators significantly more simplicity and flexibility in managing and moving virtual machines across their virtual infrastructures compared to the traditional vMotion and Storage vMotion migration solutions. More information about vMotion can be found in the VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices white paper.
  • vSphere 5.5 improves server power management by enabling processor C-states, in addition to the previously-used P-states, to improve power savings in the Balanced policy setting. More information about these improvements can be found in the Host Power Management in vSphere 5.5 white paper.

We measure the performance and power savings of these features when replacing shared storage with local disks and PCIe flash storage using a modified version of VMware VMmark 2.5. VMmark is a multi-host virtualization benchmark that uses varied application workloads, as well as common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of VMs running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview. The benchmark was modified to replace the traditional vMotion workload component with the new shared-nothing, unified live migration.

Testing Methodology

VMmark 2.5 was modified to convert the vMotion workload into a migration without shared storage. All other workloads were unchanged. This allowed a comparison of local, direct attached storage to a traditional Fibre Channel SAN. We measured the power consumption of each configuration using a pair of Yokogawa WT210 power meters, one attached to the servers and the other attached to the external storage.

Configuration

  • Systems Under Test: 2x Dell PowerEdge R710 servers
  • CPUs (per server): 2x Intel Xeon X5670 @ 2.93 GHz
  • Memory (per server): 96 GiB
  • Hypervisor: VMware vSphere 5.5
  • Local Storage (per server): 1x 785GB Fusion-io ioDrive2, 2x 300GB 10K RPM SAS drives in RAID 0
  • SAN: 8Gb Fibre Channel, 30x 200GB SATA Flash drives, 30x 600GB 15K RPM SAS drives
  • Benchmarking software: VMware VMmark 2.5

All I/O-intensive virtual disks were stored on the Fusion-io devices for local storage tests or the SATA flash drives for the SAN tests.  This included the DVD Store database files, the mail server database, and the Olio database.  All remaining virtual machine data was stored on the local SAS drives for the local storage tests and the SAN SAS drives for the SAN tests.

Results
 
VMmark performance using shared-nothing, unified live migration backed by fast local storage showed only minor differences compared to the results with shared storage.  The largest variance was seen in the infrastructure operations, which was expected as the vMotion workload was modified to include a storage migration.  The chart below shows the scores normalized to the 3-tile SAN test results.

scores

When we add the power data to these results, and compare the Performance Per Killowatt (PPKW), we see a much different picture.  The local storage-based PPKW score is much higher than shared storage due to higher power efficiency.

ppkw

We can see the reason for this difference is due to the power consumption of each configuration.  The SAN is consuming over 1000 watts, which is typical of this storage solution.  Replacing that power-hungry component with local storage greatly reduces vSphere datacenter power consumption while maintaining good performance.

power

This SAN should be able to support approximately 25 VMmark tiles (based on the storage capacity of the SSDs), roughly five times the load being supported by the two servers we had available for testing in our lab. However, it should be noted that these servers are two generations old. Current-generation two-socket servers with a comparable power usage can support 2-3x the number of tiles based on published VMmark results. This would imply that the SAN could support at most four current-generation servers. While an additional two servers will further amortize the power cost of the SAN, significant power savings would still be achieved with an all-local storage architecture.

This is not without a cost.  Removing shared storage reduces the functionality of the datacenter because there are a number of vSphere features which will no longer function, such as DRS and traditional vMotion. The reduction in the infrastructure performance due to no shared storage will limit the workloads that can be run in this manner to virtual machines with smaller disks which can be moved between hosts without shared storage fairly quickly. Virtual machines with large disks would take much longer to move and would be better suited to a shared storage environment.

We have shown that it is possible to significantly reduce datacenter power consumption without significantly reducing performance by replacing shared storage with local storage solutions.  Unified live migration enables the use of local storage without a significant infrastructure performance penalty while maintaining application performance comparable to traditional environments using shared storage for the server workloads represented in VMmark.  The resulting elimination of shared storage creates significant power savings and lower operations costs.