We have published an ESX IP Storage Troubleshooting Best Practice white paper in which we recommend vSphere customers deploying ESX IP storage over 10G networks to include 10G packet capture systems as a best practice to ensure network visibility.
The white paper explores the challenges and alternatives for packet capture in a vSphere environment with IP storage (NFS, iSCSI) datastores over a 10G network, and explains why traditional techniques for capturing packet traces on 1G networks will suffer from severe limitations (capture drops and inaccurate timestamps) when used for 10G networks. Although commercial 10G packet capture systems are commonly available, they may be beyond the budget of some vSphere customers. We present the design of a self-assembled 10G packet capture solution that can be built using commercial components relatively inexpensively. The self-assembled solution is optimized for common troubleshooting scenarios where short duration packet captures can satisfy most analysis requirements.
Our experience troubleshooting a large number of IP storage issues has shown that the ability to capture and analyze packet traces in an ESX IP storage environment can significantly reduce the mean time to resolution for serious functional and performance issues. When reporting an IP storage problem to VMware or to a storage array vendor, an accompanying packet trace file is a great piece of evidence that can significantly reduce the time required by the responsible engineering teams to identify the problem.
We were able to get one of the new four-socket Intel Skylake based servers and run some more tests. Specifically we used the Xeon Platinum 8180 processors with 28 cores each. The new data has been added to the Oracle Monster Virtual Machine Performance on VMware vSphere 6.5 whitepaper. Please check out the paper for the full details and context of these updates.
The generational testing in the paper now includes a fifth generation with a 112 vCPU virtual machine running on the Skylake based server. Performance gain from the initial 40 vCPU VM on Westmere-EX to the Skylake based 112 vCPU VM is almost 4x.
The performance gained from Hyper-Threading was also updated and shows a 27% performance gain from the use of Hyper-Threads. The test was conducted by running two 112 vCPU VMs at the same time so that all 224 logical threads are active. The total throughput from the two VMs is then compared with the throughput from a single VM.
My colleague David Morse has also updated his SQL Server monster virtual machine whitepaper with Skylake data as well.
The hugely popular Performance Troubleshooting for VMware vSphere 4 guide is now updated for vSphere 4.1 . This document provides step-by-step approaches for troubleshooting most common performance problems in vSphere-based virtual environments. The steps discussed in the document use performance data and charts readily available in the vSphere Client and esxtop to aid the troubleshooting flows. Each performance troubleshooting flow has two parts:
How to identify the problem using specific performance counters.
Possible causes of the problem and solutions to solve it.
New sections that were added to the document include troubleshooting performance problems in resource pools on standalone hosts and DRS clusters, additional troubleshooting steps for environments experiencing memory pressure (hosts with compressed and swapped memory), high CPU ready time in hosts that are not CPU saturated, environments sharing resources such as storage and network, and environments using snapshots.
The Troubleshooting guide can be found here. Readers are encouraged to provide their feedback and comments in the performance community site at this link.
Oracle Real Application Clusters (RAC) is used to run critical databases with stringent performance requirements. A series of tests recently were run in the VMware performance lab to determine how an Oracle RAC database performs when running on vSphere. The test results showed that the application performed within 11 to 13 percent of physical when running in a virtualized environment.
Two servers were used for both physical and virtual tests. Two Dell PowerEdge R710s with 2x Intel Xeon x5680 six-core processors and 96GB of RAM were connected via Fibre Channel to a NetApp FAS6030 array. The servers were dual booted between Red Hat Enterprise Linux 5.5 and vSphere ESXi 4.1. Each server was connected via three gigabit Ethernet NICs to a shared switch. One NIC was used for the public network and the other two were used for interconnect and cluster traffic.
The NetApp storage array had a total of 112 10K RPM 274GB Fibre Channel disks. Two 200GB LUNs, backed by a total of 80 disks, were used to create a data volume in Oracle ASM. Each data LUN was backed by a 40 disk RAID DP aggregate on the storage array. A 100GB log LUN was created on another volume that was backed by a 26 disk RAID DP aggregate. An additional small 2GB LUN was created to be used as the voting disk for the RAC cluster.
Each VM was configured with 32GB of RAM, three VMXNET3 virtual NICs, and a PVSCSI adapter for all the LUNs used except the OS disk. In order for the VMs to be able to share disks with physical hosts, it was necessary to mount the disks as RDMs and put the virtual SCSI adapter into physical compatibility mode. Additionally, to achieve the best performance for the Oracle RAC interconnect, the VMXNET3 NICs were configured with ethernetX.intrmode =1 in the vmx file. This option is a work around for an ESX performance bug that is specific to RHEL 5.5 VMs and to extremely latency sensitive workloads. The additional configuration option is no longer needed starting with ESX 4.1u1 because the bug is fixed starting with that version.
A four node Oracle RAC cluster was created with two virtual nodes and two physical nodes. The virtual nodes were hosted on a third server when the two servers used for testing were booted to the native RHEL environment. RHEL 5.5 x64 and Oracle 11gR2 were installed on all nodes. During tests the two servers used for testing were booted either to native RHEL or ESX for the physical or virtual tests respectively. This meant that only the two virtual nodes or the two native nodes were powered on during a physical or virtual test. The diagrams below show the same test environment when setup for the two node physical or virtual test.
Physical Test Diagram:
Virtual Test Diagram:
The servers used in testing have a total of 12 physical cores and 24 logical threads if hyperthreading is enabled. The maximum number of vCPUs per VM supported by ESXi 4.1 is eight. This made it necessary to limit the physical server to a smaller number of cores to enable a performance comparison. Using the BIOS settings of the server, hyperthreading was disabled and the number of cores limited to two and four per socket. This resulted in four and eight core physical server configurations that were compared with VM configurations of four and eight vCPUs. Limiting the physical server configurations was only done to enable a direct performance comparison and is clearly not a good way to configure a system for performance normally.
Open source DVD Store 2.1 was used as the workload for the test. DVD Store is an OLTP database workload that simulates customers logging on, browsing, and purchasing DVDs from an online store. It includes database build scripts, load files, and driver programs. For these tests, the database driver was used to directly load the database without a need to have the Web tier installed. Using the new DVD Store 2.1 functionality, two custom-size databases of 50GB each with a 12GB SGA were created as two different instances named DS2 and DS2B. Both instances were running on both nodes of the cluster and were accessed equally on each node.
Running an equal amount of load against each instance on each node was done with both the four CPU and eight CPU test cases. DS2 and DS2B instances spanned all nodes and were actively used on all nodes. An equal amount of threads were connected for each instance on each node. The amount of work was scaled up with the number of processors: twice as many DVD Store driver threads were used in the eight CPU case as compared with the four CPU case. For example, a total of 40 threads were running against node one in the four CPU test with 20 accessing DS2 and 20 accessing DS2B. Another 40 threads were accessing DS2 and DS2B on node two at the same time during that test. CPU utilization of the physical hosts and VMs were above 95% in all tests. Results are reported in terms of Orders Per Minute (OPM) and Average Response Time (RT) in milliseconds.
In both the OPM and RT measurements, the virtual RAC performance was within 11 to 13 percent of the physical RAC performance. In an intensive test running on Oracle RAC, the CPU, disk, and network were heavily utilized, but virtual performance was close to native performance. This result removes a barrier from considering virtualizing one of the more performance-intensive tier-one applications in the datacenter.