Home > Blogs > VMware VROOM! Blog > Category Archives: Uncategorized

Category Archives: Uncategorized

VMware vCenter 6.0 Performance Improvements and Best Practices

VMware vCenter Server 6.0 brings many performance improvements over previous vCenter versions, including:

  • Extensive improvements in throughput and latency.
  • vCenter Server Appliance (VCSA) parity with vCenter Server on Windows, for both inventory size and performance.

A new white paper, “VMware vCenter Server Performance and Best Practices,” illustrates these performance improvements and discusses important best practices for getting the best performance out of your vCenter Server environment.

This chart, taken from the white paper, shows the improvement in throughput over vCenter Server 5.5 at various inventory sizes:

tput

Improvements in vCenter Server 6.0 Cluster Performance

VMware vCenter Server 6.0 brings significant improvements over previous vCenter server versions with respect to cluster size and performance. vCenter Server 6.0 supports up to 64 ESXi Hosts and 8000 VMs in a single cluster. A new white paper, ” VMware vCenter Server 6.0 Cluster Performance “, describes the improvements along several dimensions:

  • VMware vCenter 6.0 can support more hosts and more VMs in a cluster.
  • VMware vCenter 6.0 can support higher operational throughput in a cluster.​
  • VMware vCenter 6.0 can support higher operational throughput with ESXi 6.0 hosts.
  • VMware vCenter 6.0 VCSA can support higher operational throughput in a cluster, compared to vCenter Server on Windows.

Here is a chart from the white paper summarizing one of the key improvements: operational throughput in a cluster:

vCenter Server 6.0 Cluster Performance Improvement

vCenter Server 6.0 Cluster Performance Improvement

Project Capstone Shows Monster VM Performance

Project Capstone was put together a few weeks before VMworld 2015 with the goal of being able to show what is possible with Monster VMs today.  VMware worked with HP and IBM to put together an impressive setup using vSphere 6.0, HP Superdome X and an IBM FlashSystem array that was able to support running four 120 vCPU VMs simultaneously.  Putting these massive Virtual Machines under load we found that performance was excellent with great scalability and a high amount of throughput achieved.

vSphere 6 was launched earlier this year and includes support for virtual machines with up to 128 virtual CPUs which is a big increase from the 64 vCPUs supported in vSphere 5.5. “Monster” virtual machines have a new upper limit and it allows for customers to virtualize even the largest of systems with very hungry CPU needs.

The HP Superdome X used for the testing is an impressive system.  It has 16 Intel Xeon E7-2890v2 2.8 GHz processors.  Each processor has 15 cores and 30 logical threads when Hyper Threading is enabled. In total this is 240 cores / 480 threads.

An IBM FlashSystem array with 20TB of superfast low latency storage was used for the project Capstone configuration.  It provided extremely low latency throughout all testing and provided such great performance that storage was never a concern or issue.  The FlashSystem was extremely easy to setup and use.  Within 24 hours of it arriving in the lab, we were actively running four 120 vCPU VMs with sub millisecond latency.

Large Oracle 12c database virtual machines running on Redhat Enterprise Linux 6.5 were created and configured with 256GB of RAM, pvSCSI virtual disk adapters, and vmxnet3 virtual NICs.  The number of VMs and the number of vCPUs for each VM was varied across the tests.

The workload used for the testing was DVD Store 3 (github.com/dvdstore/ds3).  DVD Store simulates a real online store with customers logging onto the site, browsing products and product reviews, rating products, and ultimately purchasing those products.  The benchmark is measured in Orders Per Minute, with each order representing a complete login, browsing, and purchasing process that includes many individual SQL operations against the database.

This large system with 240 cores / 480 threads, an extremely fast and large storage system, and vSphere 6 showed that even with many monster VMs excellent performance and scalability is possible.  Each configuration was first stressed by increasing the DVD Store workload until maximum throughput was achieved for a single virtual machine.  In all cases this was found to be at near CPU saturation.  The number of VM was then increased so that the entire system was fully committed.   A whitepaper to be published soon will have the full set of test results, but here we show the results for four 120 vCPU VMs and sixteen 30 vCPU VMs.capstonePerf4x120graph

 

capstonePerf16x30graph

In both cases the performance of the system when fully loaded with either 4 or 16 virtual machines achieves about 90% of perfect linear scalability when compared to the performance of a single virtual machine.

In order to be able to drive the CPU usage to such high levels all disk IO must be very fast so that the system is not waiting for a response.  The IBM FlashSystem provided .3 ms average disk latency across all tests.  Total disk IO was minimized for these tests to maximize CPU usage and throughput by configuring the database cache size to be equal to the database size.   Total disk IO per second (IOPS) peaked at about 50k and averaged 20k while maintaining the extremely low latency during tests.

These test results show that it is possible to use vSphere 6 to successfully virtualize even the largest systems with excellent performance.

 

Performance Best Practices for vSphere 6.0 is Available

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.0. This is a book designed to help system administrators obtain the best performance from vSphere 6.0 deployments.

The book addresses many of the new features in vSphere 6.0 from a performance perspective. These include:

  • A new version of vSphere Network I/O Control
  • A new host-wide performance tuning feature
  • A new version of VMware Fault Tolerance (now supporting multi-vCPU virtual machines)
  • The new vSphere Content Library feature

We’ve also updated and expanded on many of the topics in the book. These include:

  • VMware vStorage APIs for Array Integration (VAAI) features
  • Network hardware considerations
  • Changes in ESXi host power management
  • Changes in ESXi transparent memory sharing
  • Using Receive Side Scaling (RSS) in virtual machines
  • Virtual NUMA (vNUMA) configuration
  • Network performance in guest operating systems
  • vSphere Web Client performance
  • VMware vMotion and Storage vMotion performance
  • VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance

The book can be found here http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf.

 

 

 

Network Improvements in vSphere 6 Boost Performance for 40G NICs

Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features  significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.

Test Configuration

We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3]  /  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.

In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.

For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.

Single vNIC Performance Improvements

In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.

As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).

Single VM throughput

Figure 1. 1VM vmxnet3 Receive throughput

By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.

CPU Cost Improvements for Mellanox 40GbE NIC

In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.

40G Performance

Figure 2: Multi VM CPU usage for 40G traffic

As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%.  For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.

NUMA Aware I/O

In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.

NUMA_IO

Figure 3: NUMA I/O benefit.

As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.

To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.

Conclusion

vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:

  • Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
  • Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
  • vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.

 

VMware pushes the envelope with vSphere 6.0 vMotion

vMotion in VMware vSphere 6.0 delivers breakthrough new capabilities that will offer customers a new level of flexibility and performance in moving virtual machines across their virtual infrastructures. Included with vSphere 6.0 vMotion are features – Long-distance migration, Cross-vCenter migration, Routed vMotion network – that enable seamless migrations across current management and distance boundaries. For the first time ever, VMs can be migrated across vCenter Servers separated by cross-continental distance with minimal performance impact. vMotion is fully integrated with all the latest vSphere 6 software-defined data center technologies including Virtual SAN (VSAN) and Virtual Volumes (VVOL). Additionally, the newly re-architected vMotion in vSphere 6.0 now enables extremely fast migrations at speeds exceeding 60 Gigabits per second.

In this blog, we present the latest vSphere 6.0 vMotion features as well as the performance results. We first evaluate vMotion performance across two geographically dispersed data centers connected by a 100ms round-trip time (RTT) latency network. Following that we demonstrate vMotion performance when migrating an extremely memory hungry “Monster” VM.

Long Distance vMotion

vSphere 6.0 introduces a Long-distance vMotion feature that increases the round-trip latency limit for vMotion networks from 10 milliseconds to 150 milliseconds. Long distance mobility offers a variety of compelling new use cases including whole data center upgrades, disaster avoidance, government mandated disaster preparedness testing, and large scale distributed resource management to name a few. Below, we examine vMotion performance under varying network latencies up to 100ms.

Test Configuration

We set up a vSphere 6.0 test environment with the following specifications:

Hardware

  • Two HP ProLiant DL580 G7 servers (32-core Intel Xeon E7-8837 @ 2.67 GHz, 256 GB memory)
  • Storage: Two EMC VNX 5500 arrays, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN
  • Networking: Intel 10GbE 82599 NICs
  • Latency Injector: Maxwell-10GbE appliance to inject latency in vMotion network

Software

  • VM config: 4 VCPUs, 8GB mem, 2 vmdks (30GB system disk, 20GB database disk)
  • Guest OS/Application: Windows Server 2012 / MS SQL Server 2012
  • Benchmark: DVDStore (DS2) using a database size of 12GB with 12,000,000 customers, 3 drivers without think-time

vsphere60-fig1-2-new

Figure 1 illustrates the logical deployment of the test-bed used for long distance vMotion testing. Long distance vMotion is supported with both no shared storage infrastructure and with shared storage solutions such as EMC VPLEX Geo which enables shared data access across long distances. Our test-bed didn’t use shared storage which resulted in migration of the entire state of VM including its memory, storage and CPU/device states. As shown in Figure 2, our test configuration deployed a Maxwell-10GbE network appliance to inject latency in vMotion network.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration Time: Total time taken for migration to complete
  • Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
  • Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results

We investigated the impact of long distance vMotion on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2012 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 12GB. Figure 3 shows the migration time and VM switch-over time when migrating an active SQL Server VM at different network round-trip latencies. In all the test scenarios, we used a load of 3 DS2 users with no think time that generated substantial load on the VM. The migration was initiated during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the VM was close to 120%, and the average read IOPS and average write IOPS were about 200 and 150, respectively.
ldvmotion-fig3 Figure 3 shows that the impact of round-trip latency was minimal on both duration of the migration and switch-over time, thanks to the latency aware optimizations in vSphere 6.0 vMotion. The difference in the migration time among the test scenarios was in the noise range (<5%). The switch-over time increased marginally from about 0.5 seconds in 5ms test scenario to 0.78 seconds in 100ms test scenario.

ldvmotion-fig4 Figure 4 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after vMotion on a 100 ms round-trip latency network. In our tests, DVD store benchmark driver was configured to report the performance data at a fine granularity of 1 second (default: 10 seconds). As shown in the figure, the impact on SQL Server throughput was minimal during vMotion. The only noticeable dip in performance was during the switch-over phase (0.78 seconds) from the source to destination host. It took less than 5 seconds for the SQL server to resume to normal level of performance.

Faster migration

Why are we interested in extreme performance? Today’s datacenters feature modern servers with many processing cores (up to 80), terabytes of memory and high network bandwidth (10 and 40 GbE NICs). VMware supports larger “monster” virtual machines that can scale up to 128 virtual CPUs and 4TB of RAM. Utilizing higher network bandwidth to complete migrations of these monster VMs faster can enable you to implement high levels of mobility in private cloud deployments. The reduction in time to move a virtual machine can also reduce the overhead on the total network and CPU usage.

Test Config

  • Two Dell PowerEdge R920 servers (60-core Intel Xeon E7-4890 v2 @ 2.80GHz, 1TB memory)
  • Networking: Intel 10GbE 82599 NICs, Mellanox 40GbE MT27520 NIC
  • VM config: 12 VCPUs, 500GB mem
  • Guest OS: Red Hat Enterprise Linux Server 6.3

We configured each vSphere host with four Intel 10GbE ports and a single Mellanox 40 GbE port with total of 80Gb/s network connectivity between the two vSphere hosts. Each vSphere host was configured with five vSwitches, with four vSwitches having one unique 10GbE uplink port and fifth vSwitch with a 40GbE uplink port. The MTU of the NICs was set to the default of 1500 bytes. We created one VMkernel adapter on each of four vSwitches with 10GbE uplink port and four VMkernel adapters on the vSwitch with 40GbE uplink port. All the 8 VMkernel adapters were configured on the same subnet. We also enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 80Gb/s network connectivity.

Methodology

To demonstrate the extreme vMotion throughput performance, we simulated a very heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 300GB memory inside the guest and touched a random byte in each memory page in an infinite loop. We migrated this virtual machine between the two vSphere hosts under different test scenarios: vMotion over 10Gb/s network, vMotion over 20Gb/s network, vMotion over 40Gb/s network and vMotion over 80Gb/s network. We used esxtop to monitor network throughput and CPU utilization on the source and destination hosts.

Test Results

ldvmotion-fig5

Figure 5 compares the peak network bandwidth observed in vSphere 5.5 and vSphere 6.0 under different network deployment scenarios. Let us first consider the vSphere 5.5 vMotion throughput performance. Figure 5 shows vSphere 5.5 vMotion reaches line rate in both 10Gb/s network and 20Gb/s network test scenarios. When we increased the available vMotion network bandwidth to beyond 20 Gb/s, the vMotion peak usage was limited to 18Gb/s in vSphere 5.5. This is because in vSphere 5.5 vMotion, each vMotion is assigned by default two helper threads which do the bulk of vMotion processing. Since the vMotion helper threads are CPU saturated, there is no performance gain when adding additional network bandwidth. When we increased the number of vMotion helper threads from 2 to 4 in the 40Gb/s test scenario, and thereby removed the CPU bottleneck, we saw the peak network bandwidth usage of vMotion in vSphere 5.5 increase to 32Gb/s. Tuning the helper threads beyond four hurt vMotion performance in 80Gb/s test scenario, as vSphere 5.5 vMotion has some locking issues which limit the performance gains when adding more helper threads. These locks are VM specific locks that protect VM’s memory.

The newly re-architected vMotion in vSphere 6.0 not only removes these lock contention issues but also obviates the need to apply any tunings. During the initial setup phase, vMotion dynamically creates the appropriate number of TCP/IP stream channels between the source and destination hosts based on the configured network ports and their bandwidth. It then instantiates a vMotion helper thread per stream channel thereby removing the necessity for any manual tuning. Figure 5 shows vMotion reaches line rate in 10Gb/s, 20Gb/s and 40Gb/s scenarios, while utilizing little over 64 Gb/s network throughput in 80Gb/s scenario. This is over a factor of 3.5x improvement in performance when compared to vSphere 5.5.

ldvmotion-fig5
Figure 6 shows the network throughput and CPU utilization data in vSphere 6.0 80Gb/s test scenario. During vMotion, memory of the VM is copied from the source host to the destination host in an iterative fashion. In the first iteration, the vMotion bandwidth usage is throttled by the memory allocation on the destination host. The peak vMotion network bandwidth usage is about 28Gb/s during this phase. Subsequent iterations copy only the memory pages that were modified during the previous iteration. The number of pages transferred in these iterations is determined by how actively the guest accesses and modifies the memory pages. The more modified pages there are, the longer it takes to transfer all pages to the destination server, but on the flip side, it enables vMotion’s advanced performance optimizations to kick-in to fully leverage the additional network and compute resources. That is evident in the third pre-copy iteration when the peak measured bandwidth was about 64Gb/s and the peak CPU utilization (esxtop ‘PCPU Util%’ counter) on destination host was about 40%.

Conclusions

The main results of this performance study are the following:

  • The dramatic 10x increase in round-trip time support offered in long-distance vMotion now makes it possible to migrate workloads non-disruptively over long distances such as New York to London
  • Remarkable performance enhancements in vSphere 6.0 towards improving Monster VM migration performance (up to 3.5x improvement over vSphere 5.5) in large scale private cloud deployments

 

 

 

VMware Horizon 6 and Hardware Accelerated 3D Graphics

A recently published paper presents best practices and performance data for Horizon 6’s support for hardware accelerated 3D. Awhile back, we published a paper about the same subject with VMware Horizon View 5.2. The new paper updates all the graph data for Horizon 6 and shows the improvements in performance. The View Planner 3.5 benchmark is used to simulate four workloads:

  • A light 3D workload, which simulates an office worker using such applications as Office, Acrobat, and Internet Explorer.
  • A light CAD workload, which adds the SOLIDWORKS CAD viewer to the light 3D workload. In this test, the CAD viewer is used to run two models: a sea scooter and a cross-section of a shaft.
  • The sea scooter and shaft models are run in the SOLIDWORKS CAD viewer without any other applications running on the test system.
  • A Solid Edge CAD viewer is run on its own using a different model of a 3-to-1 reducer.

Read the results for VMware Horizon 6 and Hardware Accelerated 3D Graphics.

Virtual SAN and SAP IQ – a Perfect Match

A performance study shows that VMware vSphere 5.5 with Virtual SAN as the storage backend provides an excellent platform for virtualized deployments of SAP IQ Multiplex Servers.

We created four virtual machines with the RHEL 6.3 operating system, and these virtual machines made up the SAP IQ Multiplex Server, which used Virtual SAN as its storage backend. In order to measure performance, we looked at the distributed query processing (DQP) modes of SAP IQ. In DQP, work is performed by threads running on both leader and worker nodes, and intermediate results are transmitted between these nodes through a shared disk space, or over an inter-node network. In the paper, we refer to these modes as storage-transfer and network-transfer.

In a test consisting of concurrent streams of queries designed to emulate a multi-user scenario, we found that the read-heavy I/O profile of this workload takes full advantage of the Virtual SAN’s flash acceleration layer. Data read from magnetic disks in each disk group, is cached in the SSD in the disk group. Since 70% of SSD capacity is reserved for the read cache, a significant amount of data is quickly placed in very low latency storage. Once it is warmed up, I/O requests are served from the read cache, leading to fast query response times. Add to this SAP IQ’s ability to use network resources to handle intermediate results transfer and we get an additional bump in throughput since we no longer have the overhead of writing intermediate, shared results to disk.

Read more about Distributed Query Processing in SAP IQ on VMware vSphere and Virtual SAN.

Microsoft Exchange Server Shows Great Performance on VMware Virtual SAN

Email servers are a business-critical component of IT systems implementations and Exchange Server is one of the most ubiquitous of them. As such, we wanted to see how we could leverage Virtual SAN to bring new technology in serving the storage needs of this application. We administered some tests to see how Exchange Server would perform on Virtual SAN. We ran five Virtual SAN servers, and each server hosted two virtual machines with the Exchange Server roles Mailbox and HUB. The first host had an added virtual machine for the AD Server role. A client virtual machine on a separate host ran the load generator.

Benchmarks are an important part of performance testing—we used Exchange Load Generator to simulate, for Exchange Server, users sending and receiving email. Then we measured the Sendmail latency of these requests for the average and 95th-percentile for three separate loads of 12,000 users, 16,000 users, and 20,000 users. This shows how Virtual SAN can accommodate the storage needs from additional users and be flexible for scaling out.

The results are shown in the following figure. The industry-standard measure of good latency is anything below 500ms. As shown here, the Sendmail latency is well below 500ms for both the average and 95th-percentile.

exchange-vsan-perf

 

For more information, read the paper here.

First Certified SAP BW-EML Benchmark on Virtual HANA

The first certified SAP Business Warehouse-Enhanced Mixed Workload (BW-EML) standard application benchmark based on a virtual HANA database was recently published by HP.  We worked with HP to configure and run this benchmark using a virtual HANA database running on vSphere 5.5 in a monster VM of 64 vCPUs and almost 1TB of RAM.  The test was run with a total of 2 billion records and achieved a throughput of 111,850 ad-hoc navigation steps per hour.

The same hardware configuration was used by HP to publish a native only benchmark with the same number of records. In that test, the result was 126,980 ad-hoc navigation steps per hour which is only 12% higher throughput than the virtual HANA result.

BW-EML_VirtualHANA_Graph_VROOM

Although the hardware setup was the same, this comparison between native and virtual performance has one wrinkle that gave the native system a slight advantage, estimated to be about 5%.

The reason for the estimated 5% advantage for the native system is due to the difference between cores and threads and the maximum number of vCPUs.  In the case of the native test, the BW-EML workload was able to exercise all 120 hardware threads of the physical 60 core server.  The number of threads is twice the number of physical server cores because these processors utilize Intel Hyper-Threading technology.

In vSphere 5.5 (the current version) the maximum number of vCPUs that can be used in a single VM is 64. Each vCPU is mapped to a hardware thread when scheduled to run. This limits the number of hardware threads that a single VM can use to 64, which means that for this test only slightly more than half of the 120 hardware server threads could be used for the HANA virtual machine. This means that the virtual machine was not able to directly benefit from Hyper-Threading, but was able to use all 60 cores.

The benefit of Hyper-Threading can be as much as 20% to 30% for some applications, but in the case of the BW-EML benchmark, it is estimated to be about 5%.  This estimate was found by running the native BW-EML benchmark system with and without Hyper-Threading enabled.  Because the virtual machine was not able to use the Hyper-Threads, it is estimated that the native system had a 5% advantage due to its ability to use all 120 threads of the physical server.

In theory, the advantage for the native system could be reduced by either creating a bigger virtual machine or running the native system without Hyper-Threading.  If this were done, then the difference between native and virtual should be about 5% smaller and would mean that the difference between native and virtual could shrink to single digits (approximately 7%).

Additional details about the certified SAP BW-EML benchmark configurations used in the tests: SAP HANA 1.0 on HP DL580 Gen8, 4 processors with 60 cores / 120 threads using Intel Xeon E7-4880 v2 running at 2.5 GHz and 1TB of main memory (each processor has 15 cores / 30 threads).  The application servers were SAP NetWeaver 7.30 on HP BL680 G7, 4 processors with 40 cores / 80 threads using Intel Xeon E7-4870 running at 2.4 GHz and 1TB of main memory (each processor has 10 cores / 20 threads). The OS used for all servers was SuSE Enterprise Linux Server 11 SP2.  The certification number for the native test is 2014009 and the certification number for the virtual test is 2014021.