Home > Blogs > VMware VROOM! Blog > Monthly Archives: September 2013

Monthly Archives: September 2013

Performance Best Practices for vSphere 5.5 is Available

We are pleased to announce the availability of Performance Best Practices for vSphere 5.5. This is a book designed to help system administrators obtain the best performance from vSphere 5.5 deployments.

The book addresses many of the new features in vSphere 5.5 from a performance perspective. These include:

  • vSphere Flash Read Cache, a new feature in vSphere 5.5 allowing flash storage resources on the ESXi host to be used for read caching of virtual machine I/O requests.
  • VMware Virtual SAN (VSAN), a new feature (in beta for vSphere 5.5) allowing storage resources attached directly to ESXi hosts to be used for distributed storage and accessed by multiple ESXi hosts.
  • The VMware vFabric Postgres database (vPostgres).

We’ve also updated and expanded on many of the topics in the book. These include:

  • Running storage latency and network latency sensitive applications
  • NUMA and Virtual NUMA (vNUMA)
  • Memory overcommit techniques
  • Large memory pages
  • Receive-side scaling (RSS), both in guests and on 10 Gigabit Ethernet cards
  • VMware vMotion, Storage vMotion, and Cross-host Storage vMotion
  • VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM)
  • VMware Single Sign-On Server

The book can be found here.

Line-Rate Performance with 80GbE and vSphere 5.5

With the increasing number of physical cores in a system, the networking bandwidth requirement per server has also increased. We often find many networking-intensive applications are now being placed on a single server, which results in a single vSphere server requiring more than one 10 Gigabit Ethernet (GbE) adapter. Additional network interface cards (NICs) are also deployed to separate management traffic and the actual virtual machine traffic. It is important for these servers to service the connected NICs well and to drive line rate on all the physical adapters simultaneously.

vSphere 5.5 supports eight 10GbE NICs on a single host, and we demonstrate that a host running with vSphere 5.5 can not only drive line rate on all the physical NICs connected to the system, but can do it with a modest increase in overall CPU cost as we add more NICs.

We configured a single host with four dual-port Intel 10GbE adapters for the experiment and connected them back-to-back with an IXIA Application Network Processor Server with eight 10GbE ports to generate traffic. We then measured the send/receive throughput and the corresponding CPU usage of the vSphere host as we increased the number of NICs under test on the system.

Environment Configuration

  • System Under Test: Dell PowerEdge R820
  • CPUs: 4 x  Intel Xeon Processors E5-4650 @ 2.70GHz
  • Memory: 128GB
  • NICs:8 x Intel 82599EB 10GbE, SFP+ Network Connection
  • Client: Ixia Xcellon-Ultra XT80-V2, 2U Application Network Processor Server

Challenges in Getting 80Gbps Throughput

To drive near 80 gigabits of data per second from a single vSphere host, we used a server that has not only the required CPU and memory resources, but also the PCI bandwidth that can perform the necessary I/O operations. We used a Dell PowerEdge Server with an Intel E5-4650 processor because it belongs to the first generation of Intel processors that supports PCI Gen 3.0. PCI Gen 3.0 doubles the PCI bandwidth capabilities compared to PCI Gen 2.0. Each dual-port Intel 10GbE adapter needs at least a PCI Gen 2.0 x8 to reach line rate. Also, the processor has Intel Data Direct I/O Technology where the packets are placed directly in the processor cache rather than going to the memory. This reduces the memory bandwidth consumption and also helps reduce latency.

Experiment Overview

Each 10GbE port of the vSphere 5.5 server was configured with a separate vSwitch, and each vSwitch had two Red Hat 6.0 Linux virtual machines running an instance of Apache web server. The web server virtual machines were configured with 1 vCPU and 2GB of memory with VMXNET3 as the virtual NIC adapter.  The 10GbE ports were then connected to the Ixia Application Server port. Since the server had two x16 slots and five x8 slots, we used the x8 slots for the four 10GbE NICs so that each physical NIC had identical resources. For each physical connection, we then configured 200 web/HTTP connections, 100 for each web server, on an Ixia server that requested or posted the file. We used a high number of connections so that we had enough networking traffic to keep the physical NIC at 100% utilization.

Figure 1. System design of NICs, switches, and VMs

The Ixia Xcellon application server used an HTTP GET request to generate a send workload for the vSphere host. Each connection requested a 1MB file from the HTTP web server.

Figure 2 shows that we could consistently get the available[1] line rate for each physical NIC as we added more NICs to the test. Each physical NIC was transmitting 120K packets per second and the average TSO packet size was close to 10K. The NIC was also receiving 400K packets per second for acknowledgements on the receive side. The total number of packets processed per second was close to 500K for each physical connection.

Figure 2. vSphere 5.5 drives throughput at available line rates. TSO on the NIC resulted in lower packets per second for send.

Similar to the send case, we configured the application server to post a 1MB file using an HTTP POST request for generating receive traffic for the vSphere host. We used the same number of connections and observed similar behavior for the throughput. Since the NIC does not have support for hardware LRO, we were getting 800K packets per second for each NIC. With eight 10GbE NICs, the packet rate reached close to 6.4 million packets per second. VMware does Software LRO for Linux and as a result we see large packets in the guest. The guest packet rate is around 240K packets per second. There was also significant traffic for TCP acknowledgements and for each physical NIC. The host was transmitting close to 120K acknowledgement packets for each physical NIC, bringing the total packets processed close to 7.5 million packets per second for eight 10Gb ports.

Figure 3. Average vSphere 5.5 host CPU utilization for send and receive

We also measured the average CPU reported for each of the tests. Figure 3 shows that the vSphere host’s CPU usage increased linearly as we added more physical NICs to the test for both send and receive. This indicates that performance improves at an expected and acceptable rate.

Test results show that vSphere 5.5 is an excellent platform on which to deploy networking-intensive workloads. vSphere 5.5 makes use of all the physical bandwidth capacity available and does this without incurring additional CPU cost.

 


[1]A 10GbE NIC can achieve only 9.4 Gbps of throughput with standard MTU. For a 1500 byte packet, we have 40 bytes for the TCP /IP header and 38 bytes for the Ethernet frame format.

Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5

VMware vSphere ensures that virtualization overhead is minimized so that it is not noticeable for a wide range of applications including most business critical applications such as database systems, Web applications, and messaging systems. vSphere also supports well applications with millisecond-level latency constraints, including VoIP services. However, performance demands of latency-sensitive applications with very low latency requirements such as distributed in-memory data management, stock trading, and high-performance computing have long been thought to be incompatible with virtualization.

vSphere 5.5 includes a new feature for setting latency sensitivity in order to support virtual machines with strict latency requirements. This per-VM feature allows virtual machines to exclusively own physical cores, thus avoiding overhead related to CPU scheduling and contention. A recent performance study shows that using this feature combined with pass-through mechanisms such as SR-IOV and DirectPath I/O helps to achieve near-native performance in terms of both response time and jitter.

The paper explains major sources of latency increase due to virtualization in vSphere and presents details of how the latency-sensitivity feature improves performance along with evaluation results of the feature. It also presents some best practices that were concluded from the performance evaluation.

For more information, please read the full paper: Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5.

 

Simulating different VDI users with View Planner 3.0

VDI benchmarking is hard. What makes it challenging is getting a good representation or simulation of VDI users.  If we closely look at typical office users, we can get a spectrum of VDI users where at the one end of spectrum, the user may be using some simple Microsoft Office applications at a relatively moderate speed, whereas at the other end of spectrum, the user may be running some CPU-heavy multimedia applications and switching between many applications much faster. We classify the fast user as the power user or the “heavy” user, whereas we classify the user at the other end of the spectrum as the task worker or as the “light” user. In between the two categories, we define one more category which lies in between these two ends of the spectrum, which is the “medium” user.

To simulate these different categories of users and to make the job of VDI benchmarking much easier, we have made VMware View Planner 3.0, which simulates a workload representative of many user-initiated operations that take place in a typical VDI environment. The tool simulates typical Office user applications such as PowerPoint, Outlook, and Word; and Adobe Reader, Internet Explorer Web browser, multimedia applications, and so on. The tool can be downloaded from: http://www.vmware.com/products/desktop_virtualization/view-planner/overview.html.

If we look at the three categories of VDI users outlined above, one of the main differentiating factors across this gamut of VDI users is how fast they act and this is simulated using the concept of “think time” in the View Planner tool. The tool uses the thinktime parameter to randomly sleep before starting the next application operation. For the heavy user, the value of thinktime is kept very low at 2 seconds. This means that operations are happening very fast and users are switching across different applications or doing operations in an application every 2 seconds on average. The View Planner 3.0 benchmark defines a score, called “VDImark” which is based on this “heavy” user workload profile. For a medium user, the think time is set to 5 seconds, and for a light user, the think time is set to 10 seconds. The heavy VDI user also uses a bigger screen resolution compared to the medium or light user. The simulation of these category of users in the View Planner tool is summarized in the table below:

In order to show the capability of View Planner 3.0 to determine the sizing for VDI user VMs per host, we ran a flexible mode of View Planner 3.0, which allowed us to create medium and light user workloads (the heavy workload profile pre-exists), as well to understand the user density for different types of VDI users for a given system. The flexible mode will be available soon through Professional Services Organization (PSO) and to selected partners.

The experimental setup we used to compare these different user profiles is shown below:

In this test, we want to determine how many VMs can be run on the system while each VM is performing its heavy, medium, or light user profiles. In order to do this, we need to set a baseline of acceptable performance, which is defined by the quality of service (QoS) as defined in the View Planner user guide. The number of VMs that passed the QoS score is shown in the chart below.

The chart shows that we can run about 53 VMs for the heavy user (VDImark), 67 VMs for the medium user, and 91 VMs for light users. So, we could consolidate about 25% more desktops if we used this system to host users with medium workloads instead of heavy workloads. And we could consolidate 35% more desktops if we used this system to host users with light workloads instead of medium workloads. So, it is crucial to fully specify the user profile whenever we talk about the user density.

In this blog, we demonstrated how we used the View Planner 3.0 flexible mode to run different user profiles and to understand the user density for a system under test. If you have any questions and want to know more about View Planner, you can reach out to the team at viewplanner-info@vmware.com

IPv6 performance improvements in vSphere 5.5

Many of our customers use IPv6 networks in their datacenters for a variety of reasons. We expect that many more will transition from IPv4 to IPv6 to reap the large address range and other benefits that IPv6 provides. Keeping this in mind, we have worked on a number of performance enhancements for the way that vSphere 5.5 manages IPv6 network traffic. Some new features that we have implemented include:

• TCP Checksum Offload: For Network Interface Cards (NICs) that support this feature, the computation of the TCP checksum of the IPv6 packet is offloaded to the NIC.

• Software Large Receive Offload (LRO): LRO is a technique of aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed and saving CPU. Many NICs do not support LRO for IPv6 packets in hardware. For such NICs, we implement LRO in the vSphere network stack.

• Zero-Copy Receive: This feature prevents an unnecessary copy from the packet frame to a memory space in the vSphere network stack. Instead, the frame is processed directly.

vSphere 5.1 offers the same features, but only for IPv4. So, in vSphere 5.1, services such as vMotion, NFS, and Fault Tolerance had lower bandwidth in IPv6 networks when compared to IPv4 networks. vSphere 5.5 solves that problem—it delivers similar performance over both IPv4 and IPv6 networks. A seamless transition from IPv4 to IPv6 is now possible.

Next, we demonstrate the performance of vMotion over a 40Gb/s network connecting two vSphere hosts. We also demonstrate the performance of networking traffic between two virtual machines created on the vSphere hosts.

System Configuration
We set up a test environment with the following specifications:

• Servers: 2 Dell PowerEdge R720 servers running vSphere 5.5.
• CPU: 2-socket, 12-core Intel Xeon E5-2667 @ 2.90 GHz.
• Memory: 64GB memory; 32GB spread across two NUMA nodes.
• Networking: 1 dual-port Intel 10GbE and 1 dual-port Broadcom 10GigE adapter placed on separate PCI Gen-2 x8 lanes in both machines. We thus had 40Gb/s of network connectivity between the two vSphere hosts.
• Virtual Machine for vMotion: 1 VM running Red Hat Enterprise Linux Server 6.3 assigned 2 virtual CPUs (vCPUs) and 48GB memory. We migrate this VM between the two vSphere hosts.
• Virtual Machines for networking tests: A pair of VMs running Red Hat Enterprise Linux server 6.3, assigned 4 vCPUs and 16GB memory, on each host. We use these VMs to test the performance of networking traffic between two VMs.

We configured each vSphere host with four vSwitches, each vSwitch having one 10GbE uplink port. We created one VMkernel adapter on each vSwitch. Each VMkernel adapter was configured on the same subnet. The MTU of the NICs was set to the default of 1500 bytes. We enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 40Gb/s network connectivity. We created four VMXNET3 virtual adapters on the pair of virtual machines used for networking tests.

Methodology
In order to demonstrate the performance for vMotion, we simulated a heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 48GB memory in the virtual machine and touched one byte in each page in an infinite loop. We migrated this virtual machine between the two vSphere hosts over the 40Gb/s network. We used net-stats to monitor network throughput and CPU utilization on the sending and receiving systems. We also noted the bandwidth achieved in each pre-copy iteration of vMotion from VMkernel logs.

In order to demonstrate the performance of virtual machine networking traffic, we use Netperf 2.60 to simulate traffic from one virtual machine to the other. We create two connections for each virtual adapter. Each connection generates traffic for the TCP_STREAM workload, with 16KB message size and 256KB socket buffer size. As in the previous experiment, we used net-stats to monitor network throughput and CPU utilization.

Results
Figures 1 and 2 show, for IPv4 and IPv6 traffic, the network throughput and CPU utilization data that we collected over the 40-second duration of the migration. After the guest memory is staged for migration, vMotion begins iterations of pre-copying the memory contents from the source vSphere host to the destination vSphere host.

In the first iteration, the destination vSphere host needs to allocate pages for the virtual machine. Network throughput is below the available bandwidth in this stage as vMotion bandwidth usage is throttled by the memory allocation on the destination host. The average network bandwidth during this phase was 1897 megabytes per second (MB/s) for IPv4 and 1866MB/s for IPv6.

After the first iteration, the source vSphere host sends the delta of changed pages. During this phase, the average network bandwidth was 4301MB/s with IPv4 and 4091MB/s with IPv6.

The peak measured bandwidth in netstats was 34.5Gb/s for IPv4 and 32.9Gb/s for IPv6. The CPU utilization of both systems followed a similar trend for both IPv4 and IPv6. Please also note that vMotion is very CPU intensive on the receiving vSphere hosts, and high CPU clock speed is necessary to achieve high bandwidths. The results are summarized in Table 1. In all, migration of the virtual machine was complete in 40 seconds regardless of IPv4 or IPv6 connectivity.

vMotion over an IPv4 network
Figure 1. vMotion over an IPv4 network
vMotion over an IPv4 network
Figure 2. VMotion over an IPv6 network

vMotion-IPv4 vs IPv6
Table 1. vMotion results—IPv4 versus IPv6

The results for virtual machine networking traffic are in Table 2. While the throughput with IPv6 is about 2.5% lower, the CPU utilization is the same on both the sending as well as the receive sides.

Virtual Machine Performance - IPv4 vs IPv6
Table 2. Virtual machine networking results—IPv4 versus IPv6

Thanks to a number of IPv6 enhancements added to vSphere 5.5, migrations with vMotion occur over IPv6 networks at speeds within 5%, compared to those over IPv4 networks. For virtual machine networking performance, the throughput of IPv6 is within 2.5% of IPv4. In addition, testing shows that we can drive bandwidth close to 40Gb/s link speeds with both protocols. Combined, this functionality allows for a seamless transition from IPv4 to IPv6 with little performance impact.

VMware vSphere 5.5 Host Power Management (HPM) saves more power and improves performance

VMware recently released a white paper on the power and performance improvements in the Host Power Management (HPM) feature in vSphere 5.5. With new improvements in HPM, one can save significant power and gain decent performance in many common scenarios. The paper shows that power savings of up to 20% can be achieved in vSphere5.5 by using industry standard SPEC benchmarks. The paper also describes some of the best practices to follow when using HPM.

One experiment indicates that you can get around a 10% increase in performance in vSphere5.5 when deep C-states (greater than C1/halt, e.g., C3 and C6) are enabled along with turbo mode.

For more interesting results and data, please read the full paper

Note: HPM works at a single host level as opposed to DPM which works on a cluster of hosts