Home > Blogs > VMware VROOM! Blog

New White Paper: Best Practices for Optimizing Big Data Performance on vSphere 6

A new white paper is available showing how to best deploy and configure vSphere for Big Data applications such as Hadoop and Spark. Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.

The best practices were tested on a Dell 12-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark (Support Vector Machines and Logistic Regression) were run on the cluster. The virtualized cluster outperformed the bare metal cluster by 5-10% for all MapReduce and Spark workloads with the exception of one Spark workload, which ran at parity. All workloads showed excellent scaling from 5 to 10 worker servers and from smaller to larger dataset sizes.

Here are the results for the TeraSort suite:

TeraSort Suite Performance

And for Spark Support Vector Machines:

Spark Support Vector Machine Performance

Here are the best practices cited in this paper:

  • Reserve about 5-6% of total server memory for ESXi; use the remainder for the virtual machines.
  • Create 1 or more virtual machines per NUMA node.
  • Limit the number of disks per DataNode to maximize the utilization of each disk – 4 to 6 is a good starting point.
  • Use eager-zeroed thick VMDKs along with the ext4 filesystem inside the guest.
  • Use the VMware Paravirtual SCSI (pvscsi) adapter for disk controllers; use all 4 virtual SCSI controllers available in vSphere 6.0.
  • Use the vmxnet3 network driver; configure virtual switches with MTU=9000 for jumbo frames.
  • Configure the guest operating system for Hadoop performance including enabling jumbo IP frames, reducing swappiness, and disabling transparent hugepage compaction.
  • Place Hadoop master roles, ZooKeeper, and journal nodes on 3 virtual machines for optimum performance and to enable high availability.
  • Dedicate the worker nodes to run only the HDFS DataNode, YARN NodeManager, and Spark Executor roles.
  • Use the Hadoop rack awareness feature to place virtual machines belonging to the same physical host in the same rack for optimized HDFS block placement.
  • Run the Hive Metastore in a separate database.
  • Set the Yarn cluster container memory and vcores to slightly overcommit both resources.
  • Adjust the task memory and vcore requirement to optimize the number of maps and reduces for each application.

All details are in the paper, Big Data Performance on vSphere 6: Best Practices for Optimizing Virtualized Big Data Applications.


vSphere 6.0 U2 Storage Performance with 32Gb Fibre Channel

We compared the I/O performance of vSphere 6.0 U2 over 16Gb and 32Gb Emulex FC HBAs connected via a Brocade G620 FC switch to an EMC VNX7500 storage array.

Iometer, a common microbenchmark, was used to generate the workload for various block sizes. For single-VM experiments, we measured sequential read and sequential write throughput. For multi-VM experiments, we measured random read IOPS and throughput.

Our experiments showed that vSphere 6 can achieve near line rate with 32Gb FC.

For details, please see the whitepaper Storage I/O Performance on VMware vSphere 6.0 U2 over 32 Gigabit Fibre Channel.

How to correctly test the performance of Virtual SAN 6.2 deduplication feature

In VMware Virtual SAN 6.2, we introduced several features highly requested by customers, such as deduplication and compression. An overview of this feature can be found in the blog: Virtual SAN 6.2 – Deduplication And Compression Deep Dive.

The deduplication feature adds the most benefit to an all-flash Virtual SAN environment because, while SSDs are more expensive than spinning disks, the cost is amortized because more workloads can fit on the smaller SSDs. Therefore, our performance testing is performed on an all-flash Virtual SAN cluster with deduplication enabled.

When testing the performance of the deduplication feature for Virtual SAN, we observed the following:

  • Unexpected deduplication ratio
  • High device read latency in the capacity tier, even though the SSD is perfectly fine

In this blog, we discuss the reason behind these two issues and share our testing experience.

When we tested the performance of Virtual SAN, we decided to use two common tools, IOBlazer and Iometer:

  1. We used IOBlazer to populate the disks. We configured IOBlazer to run 100% large sequential writes. This was to make sure all the blocks were allocated before testing any read-related workload. Some people prefer to zero out all the blocks using the dd command, which has a similar effect.
  2. We then ran an Iometer workload. We set the read percentage, randomness, I/O size, number of outstanding I/Os, and so on.

We found, however, that there were two issues with the above procedure when testing the deduplication feature:

  • Iometer did not support configuring I/O content. In other words, we could not use Iometer to generate I/Os with various deduplication ratios in step 2.
  • We should not have populated the disks using IOBlazer or dd in step 1 because each utility pollutes the disks with random data or zeros, both of which yielded the wrong deduplication ratio for later tests.

To address these issues, we decided to use the Flexible I/O (FIO) benchmark to both populate the disks and run the tests. FIO allowed us to specify the deduplication ratio. By following these steps, we were able to successfully test the deduplication feature in Virtual SAN 6.2:

  1. Run FIO with 100% 4KB sequential write with the given deduplication and compression ratio. This will populate the disks with the desired deduplication and compression ratio.
  2. Run FIO with the specified read/write percentage, I/O size, randomness, number of outstanding I/Os, and deduplication and compression ratio.

Below is a sample configuration file for FIO. We modified the parameters for different tests.

ioengine=libaio; async I/O engine for Linux
thread ; use thread rather than process
; Test name: 4K_rd70_rand100_dedup0_compr0

[job 1]
[job 2]

If steps 1 and 2 were not performed properly, the results could be unexpected. To further illustrate that, we take two issues we encountered as examples.

Issue #1: The SSD showed high read latency, but the SSD hardware had no issues

We observed a high device read latency issue with FIO micro-benchmarks. The high read latency occurred because we were issuing a large amount of concurrent I/O (outstanding I/O, also known as OIO) to the same Logic Block Address/LBA (or a small range of LBAs) on the SSD. This is more likely to happen with any type of deduplication solution, regardless of the storage vendor.

To resolve this issue, we first performed a test to learn the behavior of the SSD device. Below shows the read latency to one address with an increasing amount of outstanding I/Os.

4KB read from the same LBA:
1 OIO Latency: 0.12 ms
16 OIOs Latency: 1.51 ms
32 OIOs Latency: 3.06 ms
64 OIOs Latency: 6.07 ms
128 OIOs Latency: 12.08 ms
256 OIOs Latency: 12.68 ms

When we issued multiple OIOs to a single 4KB block, those I/Os were serialized to one single channel inside the SSD device that was connected to that offset. In other words, we lost the benefits of the SSD’s internal parallelism (from multiple channels). The device latency rose as we increased the number of OIOs. High OIO to the same LBA (or a smaller range of LBAs) caused high device read latency.

In the extreme case where we prepared the disk by zeroing out all the blocks, all the data was deduplicated to one block. As a result, the upcoming read I/O was issued to the same device address, which caused high device read latency as discussed above.

Figures 1 and 2 (stats from Virtual SAN observer tool) show sample results from our test. (Even though the screenshots show HDDs, our test was on an all-flash Virtual SAN cluster. The HDDs in the graph actually mean SSDs used as capacity tier devices.) As can be seen, inside one disk group, one capacity tier SSD (naa.55cd2e404ba2ce71 in Figure 1 or naa.55cd2e404ba535b7 in Figure 2) always shows higher than usual read latency (than in the other capacity tier SSDs). This is because we zeroed out the data blocks before running the test. Later in the test, a large amount of outstanding read I/Os were issued to a single address on that SSD.

Note: In Figures 1 and 2, where there are no units specified, the unit is milliseconds. Where an “m” is specified, the unit is microseconds. Where “k” is specified, the unit is thousands.


Figure 1. Sample test 1 showed the first capacity-tier SSD (“HDD”)  to have up to 3 milliseconds latency, which is much higher than the next two capacity-tier SSDs (also labelled “HDD”), which show just below 150 microseconds of latency.


Figure 2. Sample test 2 is similar to sample test 1. The bottom capacity-tier SSD (“HDD”) shows up to 6 milliseconds of latency, whereas the first two show slightly over 100 microseconds.

Issue #2: Deduplication ratio was not what we set

Because Virtual SAN distributes multiple virtual disks across its datastore, it is hard to determine the exact deduplication ratio of the data that the workload generated. In the FIO configuration file, we changed the dedupe_percentage to a desired value. However, in the testing system, there were a couple factors that affected the actual deduplication ratio reported by Virtual SAN.

  • I/Os from other virtual disks (vmdk files) can have the duplicated data. In the FIO configuration file, if the randrepeat parameter is set to 1, FIO will use the same random seed for all the disks. Although the data pattern obeys the dedupe_percentage set by the user for each vmdk, there will be high deduplicated data across vmdks. Note that those vmdks are placed on the same Virtual SAN datastore, which means that datastore will see more duplicated data than specified.
  • I/O size when preparing disks will affect the deduplication ratio. Currently, Virtual SAN uses 4KB chunk size as the unit to calculate the deduplication ratio. If the user uses non 4KB IO size to prepare the disk, Virtual SAN could see a different deduplication ratio. Meanwhile, if the IO is not aligned to 4KB (blockalign parameter), Virtual SAN could also observe a different deduplication ratio.

Figure 3 (below) shows a sample test in which we ran FIO with a 0% deduplication ratio. Due to the issues described above, Virtual SAN erroneously reports about an 80% deduplication ratio (shown in blue).


Figure 3. The blue line shows a deduplication percentage of about 80%, even though we set deduplication to be 0%.

To avoid these problems, we suggest performance testers use a 4KB I/O size (and aligned) and set randrepeat to 0 to prepare the disk in order to get the desired deduplication ratio. Note that Virtual SAN can properly handle any type of I/O configuration. The purpose of this blog is to explain the possible discrepancy between the FIO-specified dedupe_percentage and the Virtual SAN reported deduplication ratio if performance testers use different I/O configurations to evaluate the Virtual SAN datastore.


Figure 4. No blue line is shown, indicating the correct deduplication of 0%.



DRS Doctor is here to diagnose your DRS clusters

Mystery revealed, DRS for VMware vSphere is no more a black box! DRS Doctor will tell you all you need to know about your DRS clusters.

Our latest fling, DRS Doctor, will monitor your DRS clusters for virtual machine and host resource usage data, DRS-recommended migrations, and the reason behind each migration. It also monitors all the cluster-related events, tasks, and cluster balance, and logs all this information into a plain text log file that anyone can read.

Read this blog for more information on how DRS Doctor can monitor and diagnose your clusters.

Download DRS Doctor from our flings site.

Virtual SAN 6.2 Performance with OLTP and VDI Workloads

Virtual SAN is a VMware storage solution that is tightly integrated with vSphere—making storage setup and maintenance in a vSphere virtualized environment fast and flexible. Virtual SAN 6.2 adds several features and improvements, including additional data integrity with software checksum, space efficiency features of RAID-5 and RAID-6, deduplication and compression, and an in-memory client read cache.

We ran several tests to compare the performance of Virtual SAN 6.1 and 6.2 to make sure they were on par with each other. In addition, we wanted to know how new feature performance compared to a 6.2 baseline with no new features enabled. The tests used benchmark workloads that simulate real-world activities in online stores and brokerage firms (online transaction processing, or OLTP) and in a virtual desktop infrastructure (VDI) environment. We published the test results in the following papers:

One such test used a workload that simulated typical user actions in an online brokerage application. The following graphs show the virtual machine IOPs per host and disk space usage saving for the Brokerage workload.


6.2 R5 (RAID-5 configured) maintains almost the same IOPs per host as 6.2 but brings down the space usage in the cluster from about 3200GiB to 2280GiB—that is 29% space saving. 6.2 D, which has deduplication and compression enabled, brings IOPs down by 25% to about 30,000, but meanwhile saves 88% disk space, taking only 373GiB on the disks. In the 6.2 R5+D case, where RAID-5 is used together with deduplication and compression, IOPs are down further by 11% to 25,000. However, the disk space saving is 92%, using only 261GiB space in the cluster.

Please note that the substantial space saving is observed because a single Brokerage workload virtual machine contains duplicable and compressible data that can be reduced significantly by the deduplication and compression feature in Virtual SAN 6.2. (The actual space saving in your production environment will depend on the workload.)

Read more test results and find more information about test configuration in the published papers for OLTP and VDI workloads.

Peeking At The Future with Giant Monster Virtual Machines

Remember that cool project with VMware, HP Enterprise, and IBM where four super huge monster virtual machines (VMs) of 120 vCPUs each were all running at the same time on a single server with great performance? 

That was Project Capstone, and it was presented at VMworld San Francisco and VMworld Barcelona last fall as a spotlight session.  The follow-up whitepaper is now completed and published,  which means that there are lots of great technical details available with testing results and analysis. 

In addition to the four 120 vCPU VMs test, additional configurations were also run with eight 60 vCPU VMs and sixteen 30 vCPU VMs.  This shows that plenty of large VMs can be run on a single host with excellent performance when using a solution that supports tons of CPU capacity and cutting edge flash storage.

The whitepaper not only contains all of the test results from the original presentation, but also includes additional details around the performance of CPU Affinity vs PreferHT and under-provisioning.  There is also a best practices section that if focused on running monster VMs.


Tutorial Session on Performance Debugging on VMware vSphere

Ever wondered what it takes to debug performance issues on a VMware stack? How do you figure out if the performance issue is in your virtual machine, or the network layer, or the storage layer, or the hypervisor layer?

Here’s a handy tutorial that showcases a systematic approach for troubleshooting performance using tools like Esxtop, vSCSI stats and Net stats on a VMware stack. The tutorial also talks about some very useful optimizations and performance best practices.

Thanks to Ramprasad K. S. for putting together the slides based on his vast experience dealing with customer issues. Thanks also to Ramprasad and Sai Inabattini for presenting this at the CMG India 2nd Annual conference in Bangalore in November 2015, which was received very well.

Fault Tolerance Performance in vSphere 6

VMware has published a technical white paper about vSphere 6 Fault Tolerance architecture and performance. The paper describes which types of applications work best in virtual machines with vSphere FT enabled.

VMware vSphere Fault Tolerance (FT) provides continuous availability to virtual machines that require a high amount of uptime. If the virtual machine fails, another virtual machine is ready to take over the job.  vSphere achieves FT by maintaining primary and secondary virtual machines using a new technology named Fast Checkpointing. This technology is similar to Storage vMotion, which copies the virtual machine state (storage, memory, and networking) to the secondary ESXi host. Fast Checkpointing keeps the primary and secondary virtual machines in sync.

vSphere FT works with (and requires) vSphere HA—when an administrator enables FT, vSphere HA selects the secondary VM (admins can vMotion the VM to another server if needed). vSphere HA also creates a new secondary if the primary fails—the original secondary becomes the new primary, and vSphere HA selects an available virtual machine to use as the new secondary.

vSphere 6 FT supports applications with up to 4 vCPUs and 64GB memory on the ESXi host. The performance study shows results for various workloads run on virtual machines with 1, 2, and 4 vCPUs.

The workloads—which tax the virtual machine’s CPU, disk, and network—include:

  • Kernel compile – loads the CPU at 100%
  • Netperf-  measures network throughput and latency
  • Iometer- characterizes the storage I/O of a Microsoft Windows virtual machine
  • Swingbench- drives an OLTP load on a virtual machine running Oracle 11g
  • DVD Store –  drives an OLTP load on a virtual machine running Microsoft SQL Server 2012
  • A brokerage workload – simulates an OLTP load of a brokerage firm
  • vCenterServer workload – simulates actions performed in vCenter Server

Testing shows that vSphere FT can successfully protect a number of workloads like CPU-bound workloads, I/O-bound workloads, servers, and complex database workloads; however, admins should not use vSphere FT to protect highly latency-sensitive applications like voice-over-IP (VOIP) or high-frequency trading (HFT).

For the results of these tests, read the paper. Also useful is the VMware Fault Tolerance FAQ.

Virtualizing Performance-Critical Database Applications in VMware vSphere 6.0

by Priti Mishra

Performance studies have previously shown that there is no doubt virtualized servers can run a variety of applications near, or in some cases even above, that of software running natively (on bare metal). In a new white paper, we raise the bar higher and test “monster” vSphere virtual machines loaded with CPU and running the most taxing databases and transaction processing applications.

The benchmark workload, which we call Order-Entry, is based on an industry-standard online transaction processing (OLTP) benchmark called TPC-C. Both rigorous and demanding, the Order-Entry workload pushes virtual machine performance.

Note: The Order Entry benchmark is derived from the TPC-C workload, but is not compliant with the TPC-C specification, and its results are not comparable to TPC-C results.

The white paper quantifies the:

  • Performance differential between ESXi 6.0 and native
  • Performance differential between ESXi 6.0 and ESXi 5.1
  • Performance gains due to enhancements built into ESXi 6.0

Results from these experiments show that even the most demanding applications can be run, with excellent performance, in a virtualized environment with ESXi 6.0.  For example, our test results show that ESXi 6.0 virtual machines run out of the box at 90% of the performance of native systems. In addition, a 64-vCPU, 475GB VM processes 59.5K DBMS transactions per second while issuing 155K IOPS, capabilities well above even the high-end Oracle database installations. Even for applications that may require 64 or 128 vCPUs, the high-end performance boost of ESXi 6.0 over ESXi 5.1 makes ESXi 6.0 the best platform for virtualizing databases such as Oracle.

ESXi 6.0 Performance Relative to Native

With a 64-vCPU VM running on a 72-pCPU ESXi host, throughput was 90% of native throughput on the same hardware platform. Statistics which give an indication of the load placed on the system in the native and virtual machine configurations are summarized in Table 1.

Metric Native VM
Throughput in transactions per second 66.5K 59.5K
Average CPU utilization of 72 logical CPUs 84.7% 85.1%
Disk IOPS 173K 155K
Disk Megabytes/second 929MB/s 831MB/s
Network packets/second 71K/s receive
71K/s send
63K/s receive
64K/s send
Network Megabytes/second 15MB/s receive
36MB/s send
13MB/s receive
32MB/s send

Table 1. Comparison of Native and Virtual Machine Benchmark Load Profiles


The corresponding guest statistics in Table 2 provide another perspective on the resource-intensive nature of the workload. These common Linux performance metrics show that while the benchmark workload was heavy in terms of raw CPU demands, it also placed a heavy load on the operating system, interrupt handling, and the storage subsystem, areas that have traditionally been associated with high virtualization overheads.


Metric Amount
Interrupts per second 327K
Disk IOPS 155K
Context switches per second 287K
Load average 231

Table 2. Guest OS Statistics

ESXi 6.0 Performance Relative to ESXi 5.1

Experimental data comparing ESXi 6.0 with ESXi 5.1 (see Figures 1 and 2) show that high-end scale-up with ESXi 6.0 mirrors that of native systems.


Figure 1. Absolute throughput values

With ESXi 5.1, the Order-Entry benchmark throughput of a 64-vCPU VM on a 4-socket, 32 core/64 thread E7- 4870 (Westmere) server was 70% of the throughput of the same server in native mode when both servers were running at 77% CPU utilization (the native server reached a maximum CPU utilization of 88% and throughput of 54.8 transactions per second).


Figure 2. Relative throughput ratios

vSphere has the capability to handle loads far larger than that demanded by most Oracle database applications in production. Support for monster VMs with up to 128 vCPUs, throughput which is 90% of native and a significant performance boost over ESXi 5.1, make ESXi 6.0 an excellent platform for virtualizing very high end Oracle databases.

For details regarding experiments and the performance enhancements in vSphere, please read the paper.

VMware vCloud Air Database Performance Scalability with SQL Server

Previous posts have shown vSphere can easily handle running Microsoft SQL Server on four-socket servers with large numbers of cores—with vSphere 5.5 on Westmere-EX and more recently with vSphere 6 on Ivy Bridge-EX.  We recently ran similar tests on vCloud Air to measure how these enterprise databases with mission critical performance requirements perform in a cloud environment. The tests show that SQL Server databases scale very well on vCloud Air with a variety of virtual machine (VM) counts and virtual CPU (vCPU) sizes.

The benchmark tests were run with vCloud Air using their Virtual Private Cloud (VPC) subscription-based service.  This is a very compelling hybrid cloud service that allows for an on-premises vSphere infrastructure to be expanded into the public cloud in a secure and scalable way. The underlying host hardware consisted of two 8-core CPUs for a total of 16 physical cores, which meant that the maximum number of vCPUs was 16 (although additional processors were available via Hyper-Threading, they were not utilized).

Windows Server 2012 R2 was the guest OS, and SQL Server 2012 Standard edition was the database engine used for all the VMs.  All databases were placed on an SSD Accelerated storage tier for maximum disk I/O performance.  The test configurations are summarized below:

# VMs, # vCPUs, Memory configurations tested

DVD Store 2.1 (an open-source OLTP database stress tool) was the workload used to stress the VMs.  The first experiment was to scale up the number of 4 vCPU VMs.  The graph below shows that as the number of VMs is increased from 1 to 4, the aggregate performance (measured in orders per minute, or OPM) increases correspondingly:
When the size of each VM was doubled from 4 to 8 virtual CPUs, the OPM also approximately doubles for the same number of VMs as shown in the chart below.vCA_SQL_8vCPU

This final chart includes a test run with one large 16 vCPU VM.  As expected, the 16 vCPU performance was similar to the four 4vCPU VMs and eight 2vCPU VM test cases.  The slight drop can be attributed to spanning multiple physical processors and thus multiple NUMA nodes within a single VM.


In summary, SQL Server was found to perform and scale extremely well running on vCloud Air with 4, 8, and 16 vCPU VMs.  In the future, look for more benchmarks in the cloud as it continues to evolve!

For more information on vCloud Air, check out these third-party studies from Principled Technologies that compare it to competitive offerings, namely Microsoft Azure and Amazon Web Services (AWS):