Home > Blogs > VMware VROOM! Blog > Category Archives: Web/Tech

Category Archives: Web/Tech

vCenter Server 6.5 High Availability Performance and Best Practices

High availability (aka HA) services are important in any platform, and VMware vCenter Server® is no exception. As the main administrative and management tool of vSphere, it is a critical element that requires HA. vCenter Server HA (aka VCHA) delivers protection against software and hardware failures with excellent performance for common customer scenarios, as shown in this paper.

Much work has gone into the high availability feature of VMware vCenter Server® 6.5 to ensure that this service and its operations minimally affect the performance of your vCenter Server and vSphere hosts. We thoroughly tested VCHA with a benchmark that simulates common vCenter Server activities in both regular and worst case scenarios. The result is solid data and a comprehensive performance characterization in terms of:

  • Performance of VCHA failover/recovery time objective (RTO): In case of a failure, vCenter Server HA (VCHA) provides failover/RTO such that users can continue with their work in less than 2 minutes through API clients and less than 4 minutes through UI clients. While failover/RTO depends on the vCenter Server configuration and the inventory size, in our tests it is within the target limit, which is 5 minutes.
  • Performance of enabling VCHA: We observed that enabling VCHA would take around 4 – 9 minutes depending on the vCenter Server configuration and the inventory size.
  • VCHA overhead: When VCHA is enabled, there is no significant impact for vCenter Server under typical load conditions. We observed a noticeable but small impact of VCHA when the vCenter Server was under extreme load; however, it is unlikely for customers to generate that much load on the vCenter Server for extended time periods.
  • Performance impact of vCenter Server statistics level: With an increasing statistics level, vCenter Server produces less throughput, as expected. When VCHA is enabled for various statistics levels, we observe a noticeable but small impact of 3% to 9% on throughput.
  • Performance impact of a private network: VCHA is designed to support LAN networks with up to 10 ms latency between VCHA nodes. However, this comes with a performance penalty. We study the performance impact of the private network in detail and provide further guidelines about how to configure VCHA for the best performance.
  • External Platform Services Controller (PSC) vs Embedded PSC: We study VCHA performance comparing these two deployment modes and observe a minimal difference between them.

Throughout the paper, our findings show that vCenter Server HA performs well under a variety of circumstances. In addition to the performance study results, the paper describes the VCHA architecture and includes some useful performance best practices for getting the most from VCHA.

For the full paper, see VMware vCenter Server High Availability Performance and Best Practices.

vSphere 6.5 Update Manager Performance and Best Practices

vSphere Update Manager (VUM) is the patch management tool for VMware vSphere 6.5. IT administrators can use VUM to patch and upgrade ESXi hosts, VMware Tools, virtual hardware, and virtual appliances.

In the vSphere 6.5 release, VUM has been integrated into the vCenter Server appliance (VCSA) for the Linux platform. The integration eliminates remote data transfers between VUM and VCSA, and greatly simplifies the VUM deployment process. As a result, certain data-driven tasks achieve a considerable performance improvement over VUM for the Windows platform, as illustrated in the following figure:

vum-blog-fig1

To present the new performance characteristics for VUM in vSphere 6.5, a paper has been published. In particular, the paper describes the following topics:

  • VUM server deployment
  • VUM operations including scan host, scan VM, stage host, remediate host, and remediate VM
  • Remediation concurrency
  • Resource consumption
  • Running VUM operations with vCenter Server provisioning operations

The paper also offers a number of performance tips and best practices for using VUM during patch maintenance. For the full details, read vSphere Update Manager Performance and Best Practices.

Whitepaper on vSphere Virtual Machine Encryption Performance

vSphere 6.5 introduces a feature called vSphere VM encryption.  When this feature is enabled for a VM, vSphere protects the VM data by encrypting all its contents.  Encryption is done both for already existing data and for newly written data. Whenever the VM data is read, it is decrypted within ESXi before being served to the VM.  Because of this, vSphere VM encryption can have a performance impact on application I/O and the ESXi host CPU usage.

We have published a whitepaper, VMware vSphere Virtual Machine Encryption Performance, to quantify this performance impact.  We focus on synthetic I/O performance on VMs, as well as VM provisioning operations like clone, snapshot creation, and power on.  From analysis of our experiment results, we see that while VM encryption consumes more CPU resources for encryption and decryption, its impact on I/O performance is minimal when using enterprise-class SSD or VMware vSAN storage.  However, when using ultra-high performance storage like locally attached NVMe drives capable of handling up to 750,000 IOPS, the minor increase in per-I/O latency due to encryption or decryption adds up quickly to have an impact on IOPS.

For more detailed information and data, please refer to the whitepaper

vSphere 6.5 DRS Performance – A new white-paper

VMware recently announced the general availability of vSphere 6.5. Among the many new features in this release are some DRS specific ones like predictive DRS, and network-aware DRS. In vSphere 6.5, DRS also comes with a host of performance improvements like the all-new VM initial placement and the faster and more effective maintenance mode operation.

If you want to learn more about them, we published a new white-paper on the new features and performance improvements of DRS in vSphere 6.5. Here are some highlights from the paper:

 

65wp-blog-3

 

65wp-blog-2

Expandable Reservation for Resource Pools

One of the questions I was often asked about resource pools (RP) is ‘Expandable reservation’. What is expandable reservation, and why should I care about it? Although it sounds intuitive, it can be easily misunderstood.

To put it simply, a resource pool with ‘expandable reservation’ can expand its reservation by asking more resources from its parent .

The need to expand reservation comes from the increase in reservation demand of its child objects (VMs or resource pools). If the parent resource pool is short of resources, then the parent expands it reservation asking resources from the grand parent.

Let us try to understand this with a simple example. Consider the following RP hierarchy. If RP-4 has to expand its reservation, it requests resources from its parent RP-3 and if RP-3 has to expand resources it eventually requests Root-RP.

exp-res-7

 

 Resource pool with fixed reservation

A resource pool with fixed reservation cannot expand its reservation and any operation that needs reservation would fail incase of resource shortage.

Taking the above example, all the RPs have expandable reservation set. Then I changed the reservation of RP-4 from “Expandable” to a fixed value of 300MB as shown below.

exp-res-2-copy

Now the “Resource settings” for the RP-4 are,

exp-res-1-copy

We can see that, although the reservation used by all the VMs is zero, the RP-level used reservation is shown as 215MB. This reservation is coming from the VMs’ overhead memory (as computed by ESXi).

At this point I add two more VMs to the resource pool and powered-on one of them. Now the RP-level used reservation changes as shown below.

exp-res-3-copy

When I power-on another VM in the RP, it fails with error as shown below.

exp-res-4-copy

This happened because in this case, there wasn’t enough memory available to accommodate the new VM (its overhead memory needs to be reserved). Then, I change the reservation type back to “Expandable” and the power-on succeeds.

exp-res-6-copy

As we can see from the above figure, although the configured reservation shows 300MB, the used reservation is 321MB, which is the result of expandable reservation.

This is just one example of how expandable reservation can be useful. If the VMs inside the RP have their own reservations, then those will also be accounted for in the used reservation of the RP.

Hence, it is always advisable to keep RP reservation as “Expandable” as it can accommodate any increase in reservation demand by asking for more resources from the parent resource pool.

In the next post, we will look at how these reservations for a resource pool are different from the reservations for a VM and when to use them.

Latency Sensitive VMs and vSphere DRS

Some applications are inherently highly latency sensitive, and cannot afford long vMotion times. VMs running such applications are termed as being ‘Latency Sensitive’. These VMs consume resources very actively, so vMotion of such VMs is often a slow process. Such VMs require special care during cluster load balancing, due to their latency sensitivity.

You can tag a VM as latency sensitive, by setting the VM option through the vSphere web client as shown below (VM → Edit Settings → VM Options → Advanced)

edit-vmsettings-2
By default, the latency sensitivity value of a VM is set to ‘normal’. Changing it to ‘high’ will make the VM ‘Latency Sensitive’. There are other levels like ‘medium’ and ‘low’ which are experimental right now. Once the value is set to high, 100% of the VM configured memory should be reserved. It is also recommended to reserve 100% of its CPU. This white paper talks more about the VM latency sensitivity feature in vSphere.

DRS support

VMware vSphere DRS provides support for handling such special VMs. If a VM is part of a DRS cluster, tagging it as latency sensitive will create a VM-Host soft affinity rule. This will ensure that DRS will not move the VM unless it is absolutely necessary. For example, in scenarios where the cluster is over-utilized, all the soft rules will be dropped and VMs can be moved.

To showcase how this option works, we ran a simple experiment with a four host DRS cluster running a latency sensitive VM (10.156.231.165:VMZero-Latency-Sensitive-1) on one of its host (10.156.231.165)

cluster-load

As we can see from the screenshot, CPU usage of host ‘10.156.231.165’ is higher compared to the other hosts, and the cluster load is not balanced. So DRS migrates VMs from the highly utilised host (10.156.231.165) to distribute the load.

Since latency sensitive VM is a heavy consumer of resources, it will be the best possible candidate to migrate, as moving it will distribute the load in one shot. So DRS migrated the latency sensitive VM to a different host in order to distribute the load.

migrations-1

Then we put the cluster back in its original state, and set the VM latency sensitivity value to ‘high’ using VM options (as mentioned earlier). Also set 100% of memory and cpu reservations. This time, due to associated soft-affinity rule, DRS completely avoided the latency sensitive VM. It migrated other VMs from the same host to distribute the load.

migrations-2

Things to note:

  • 100% memory reservation for the latency sensitive VM is a must. Without the memory reservation, vMotion will fail; if the VM is powered-Off, it cannot be powered-On until reservation is set.
  • Since DRS uses a soft-affinity rule, sometimes the cluster might get imbalanced due to  these VMs.
  • If multiple VMs are latency sensitive, spread them across hosts and then tag them as latency sensitive. This will avoid over-utilization of hosts and results in better resource distribution.

Understanding vSphere DRS Performance – A White Paper

VMware vSphere Distributed Resource Scheduler (DRS) is responsible for placement of Virtual Machines and balancing of resources in a cluster. The key driver for DRS is VM/Application happiness, and it achieves this by effective VM placement and efficient load balancing. We have a new white paper, which tries to explain how DRS works in basic scenarios and how it can be tuned to behave differently for specific scenarios.

The white paper talks about the factors that influence DRS decisions and provides some useful insights into different parameters that can be tuned in specific scenarios to make DRS more effective. It also explains how to monitor DRS to better understand its behavior.

It covers DRS behavior in specific scenarios with some case studies. Some of these studies are around

  •  VM Consumed vs. Active Memory – How it impacts DRS behavior.
  •  Impact of VM overrides on cluster balance.
  •  Prerequisite moves during initial placement.
  •  Using shares to prioritize cluster resources.

The paper provides knowledge about the factors that affect DRS behavior and helps understand how DRS does what it does. This knowledge, along with monitoring and troubleshooting tips, including real case studies, will help tune DRS clusters for optimum performance.

Machine Learning on VMware vSphere 6 with NVIDIA GPUs

by Uday Kurkure, Lan Vu, and Hari Sivaraman

Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health science, finance, and intelligent systems, among others.

In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition.

Machine learning workloads have also become a critical part in cloud computing. For cloud environments based on vSphere, you can even deploy a machine learning workload yourself using GPUs via the VMware DirectPath I/O or vGPU technology.

GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. In a series of blogs, we will present the performance results of running machine learning benchmarks on VMware vSphere using NVIDIA GPUs.

Episode 1: Performance Results of Machine Learning with DirectPath I/O and NVIDIA GPUs

In this episode, we present the performance results of running machine learning benchmarks on VMware vSphere with NVIDIA GPUs in DirectPath I/O mode and on GRID virtual GPU (vGPU) mode.

Training Time Reduction from Hours to Minutes

Training time is the performance metric used in supervised machine learning—it is the amount of time a computer takes to learn how to solve the given problem. In supervised machine learning, the computer is given data in which the answer can be found. So, supervised learning infers a model from the available, or labelled training data.

Our first machine learning benchmark is a simple demo model in the TensorFlow library. The model classifies handwritten digits from the MNIST dataset. Each digit is a handwritten number that is centered within a consistently sized grayscale bitmap. The MNIST database of handwritten digits contains 60,000 training examples and has a test set of 10,000 examples.

First, we compare training times for the model using two different virtual machine configurations:

  • NVIDIA GRID Configuration (vg1c12m60GB): 1 vGPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
  • No GPU configuration (g0c12m60GB): No GPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
MNIST vg1c12m60GB
1 vGPU 
(secs)
g0c12m60GB
No GPU (secs)
Normalized Training Time
(wrt vg1c12)
1.0 10.06
CPU Utilization 8% 43%

The above table shows that vGPU reduces the training time by 10 times. The CPU utilization also goes down 5 times. See the graphs below.

01-training-time-mnist

02-mnist-cpu-util

Scaling from One GPU to Four GPUs

This machine learning benchmark is made up of two components:

We use the metric of images per second (images/sec) to compare the different configurations as we scale from a single GPU to 4 GPUs. The metric of images/second denotes the number of images processed per second in training the model.

Our host has two NVIDIA M60 cards. Each card has 2 GPUs. We present the performance results for scaling up from 1 GPU to 4 GPUs.

You can configure the GPUs in two modes:

  • DirectPath I/O passthrough: In this mode, the host can be configured to have 1 to 4 GPUs in a DirectPath I/O passthrough mode. A virtual machine running on the host will have access to 1 to 4 GPUs in passthrough mode.
  • GRID vGPU mode: For machine learning workloads, each VM should be configured with the highest profile vGPU. Since we have M60 GPUs, we configured VMs with vGPU type M60-8q. M60-8q implies one VM/GPU.

DirectPath I/O

First we focus on DirectPath I/O passthrough mode as we scale from 1 GPU to 4 GPUs.

CIFAR-10 g1c48m60GB g2c48m60GB g4c48m60GB
   1 GPU  2 GPUs  4 GPUs
Normalized Images/sec in Thousands (w.r.t. 1 GPU) 1.0 2.04 3.74
CPU Utilization 25% 44% 71%

As the above table shows, the images processed per second improves almost linearly with the number of GPUs on the host. This means that the number of images processed becomes greater with each increase in the number of GPUs in an amount that is expected. 1 GPU sets the normalized data at 1,000 image/sec. We expect 2 GPUs to handle about double that of 1 GPU, which the graph shows. Next, we see that 4 GPUs can handle nearly 4,000 images/sec.

03-cifar10-images-per-sec

Host CPU utilization also increases linearly, as shown in the following graph.

04-cifar10-cpu-used

Single GPU DirectPath I/O vs GRID vGPU mode

Now, we present comparison of performance results for DirectPath IO and GRID vGPU mode.

Since each VM can have only one vGPU in GRID vGPU mode, we first present the results for 1 GPU configuration in DirectPath IO mode with vGPU mode.

 

MNIST g1c48m60GB vg1c48m60GB
(Lower Is Better) DirectPath I/O GRID vGPU
Normalized Training Times 1.0 1.05

 

CIFAR-10 g1c48m60GB vg1c48m60GB
(Higher  Is Better) DirectPath I/O GRID vGPU
Normalized Images/sec 1.0 0.87

 

The above tables show that one GPU configuration in DirectPath I/O and GRID mode vGPU are very close in performance. We suggest you use GRID vGPU mode because it offers the benefits of virtualization.

Multi-GPU DirectPath I/O vs Multi-VM DirectPath I/O vs Multi-VMs in GRID vGPU mode

Now we move on to multi-GPU performance results for DirectPath I/O and GRID vGPU mode. In DirectPath I/O mode, a VM can be configured with all the GPUs on the host.  In our case, we configured the VM with 4 GPUs. In GRID vGPU mode, each VM can have at most 1 GPU. Therefore, we compare the results of 4 VMs running the same job with a VM using 4 GPUs using Direct Path I/O.

CIFAR-10 g4c48m60GB g1c12m16GB (4-vms) vg1c12m16GB(4-vms)
DirectPath I/O DirectPath I/O (4 VMs) GRID vGPU ( 4 VMs)
Normalized Images/Sec
(Higher Is Better)
1.0 0.98 0.92
CPU Utilization 71% 68% 69%

05-cifar10

06-cifar10

The multi-GPU DirectPath I/O mode configuration performs better. If your workload requirement is low latency or requires a short training time, you should use multi-GPU DirectPath I/O mode. However, other virtual machines will not be able use the GPUs on the host at the same time. If you can tolerate longer latencies or training times, we recommend using a 1-GPU configuration.  GRID vGPU mode enables the benefits of virtualization: flexibility and elasticity.

Takeaways

  • GPUs bring the training times of machine learning algorithms from hours to minutes.
  • You can use NVIDIA GPUs in two modes in the VMware vSphere environment for machine learning applications:
    • DirectPath I/O passthrough mode
    • GRID vGPU mode
  • You should use GRID vGPU mode with the highest vGPU profile. The highest vGPU profile implies 1 VM/GPU, thus giving the virtual machine full access to the entire GPU.
  • For a 1-GPU configuration, the performance of the machine learning applications in GRID vGPU mode is comparable to DirectPath I/O.
  • For the shortest training time, you should use a multi-GPU configuration in DirectPath I/O mode.
  • For running multiple machine learning jobs simultaneously, you should use GRID vGPU mode. This configuration offers a higher consolidation of virtual machines and leverages the flexibility and elasticity benefits of VMware virtualization.

References

Configuration Details

Host Configuration

Model Dell PowerEdge R730
Processor Type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU Cores 24 CPUs, each @ 2.499GHz
Processor Sockets 2
Cores per Socket 12
Logical Processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

Software Configuration

ESXi  6.0.0, 3500742
Guest OS CentOS Linux release 7.2.1511 (Core)
CUDA Driver 7.5
CUDA Runtime 7.5

VM Configurations

VM vCPUs Memory Storage GPUs Guest OS Mode
g0xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 0 CentOS 7.2 No GPU
g1xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 1 CentOS 7.2 DirectPath I/O
g2xc48m60GB 48 vCPUs 60GB 1x96GB

(SSD)

2 CentOS 7.2 DirectPath I/O
g4xc48m60GB 48 vCPUs 60GB 1x96GB

(SSD)

4 CentOS 7.2 DirectPath I/O
vg1xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 1 CentOS 7.2 GRID vGPU
g1c12m16GB 12 vCPUs 16GB 1x96GB

(SSD)

1 CentOS 7.2 DirectPath I/O
vg1c12m16GB 12 vCPUs 16GB 1x96GB

(SSD)

1 CentOS 7.2 GRID vGPU

 

 

New White Paper: Best Practices for Optimizing Big Data Performance on vSphere 6

A new white paper is available showing how to best deploy and configure vSphere for Big Data applications such as Hadoop and Spark. Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.

The best practices were tested on a Dell 12-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark (Support Vector Machines and Logistic Regression) were run on the cluster. The virtualized cluster outperformed the bare metal cluster by 5-10% for all MapReduce and Spark workloads with the exception of one Spark workload, which ran at parity. All workloads showed excellent scaling from 5 to 10 worker servers and from smaller to larger dataset sizes.

Here are the results for the TeraSort suite:

TeraSort Suite Performance

And for Spark Support Vector Machines:

Spark Support Vector Machine Performance

Here are the best practices cited in this paper:

  • Reserve about 5-6% of total server memory for ESXi; use the remainder for the virtual machines.
  • Create 1 or more virtual machines per NUMA node.
  • Limit the number of disks per DataNode to maximize the utilization of each disk – 4 to 6 is a good starting point.
  • Use eager-zeroed thick VMDKs along with the ext4 filesystem inside the guest.
  • Use the VMware Paravirtual SCSI (pvscsi) adapter for disk controllers; use all 4 virtual SCSI controllers available in vSphere 6.0.
  • Use the vmxnet3 network driver; configure virtual switches with MTU=9000 for jumbo frames.
  • Configure the guest operating system for Hadoop performance including enabling jumbo IP frames, reducing swappiness, and disabling transparent hugepage compaction.
  • Place Hadoop master roles, ZooKeeper, and journal nodes on 3 virtual machines for optimum performance and to enable high availability.
  • Dedicate the worker nodes to run only the HDFS DataNode, YARN NodeManager, and Spark Executor roles.
  • Use the Hadoop rack awareness feature to place virtual machines belonging to the same physical host in the same rack for optimized HDFS block placement.
  • Run the Hive Metastore in a separate database.
  • Set the Yarn cluster container memory and vcores to slightly overcommit both resources.
  • Adjust the task memory and vcore requirement to optimize the number of maps and reduces for each application.

All details are in the paper, Big Data Performance on vSphere 6: Best Practices for Optimizing Virtualized Big Data Applications.

 

How to correctly test the performance of Virtual SAN 6.2 deduplication feature

In VMware Virtual SAN 6.2, we introduced several features highly requested by customers, such as deduplication and compression. An overview of this feature can be found in the blog: Virtual SAN 6.2 – Deduplication And Compression Deep Dive.

The deduplication feature adds the most benefit to an all-flash Virtual SAN environment because, while SSDs are more expensive than spinning disks, the cost is amortized because more workloads can fit on the smaller SSDs. Therefore, our performance testing is performed on an all-flash Virtual SAN cluster with deduplication enabled.

When testing the performance of the deduplication feature for Virtual SAN, we observed the following:

  • Unexpected deduplication ratio
  • High device read latency in the capacity tier, even though the SSD is perfectly fine

In this blog, we discuss the reason behind these two issues and share our testing experience.

When we tested the performance of Virtual SAN, we decided to use two common tools, IOBlazer and Iometer:

  1. We used IOBlazer to populate the disks. We configured IOBlazer to run 100% large sequential writes. This was to make sure all the blocks were allocated before testing any read-related workload. Some people prefer to zero out all the blocks using the dd command, which has a similar effect.
  2. We then ran an Iometer workload. We set the read percentage, randomness, I/O size, number of outstanding I/Os, and so on.

We found, however, that there were two issues with the above procedure when testing the deduplication feature:

  • Iometer did not support configuring I/O content. In other words, we could not use Iometer to generate I/Os with various deduplication ratios in step 2.
  • We should not have populated the disks using IOBlazer or dd in step 1 because each utility pollutes the disks with random data or zeros, both of which yielded the wrong deduplication ratio for later tests.

To address these issues, we decided to use the Flexible I/O (FIO) benchmark to both populate the disks and run the tests. FIO allowed us to specify the deduplication ratio. By following these steps, we were able to successfully test the deduplication feature in Virtual SAN 6.2:

  1. Run FIO with 100% 4KB sequential write with the given deduplication and compression ratio. This will populate the disks with the desired deduplication and compression ratio.
  2. Run FIO with the specified read/write percentage, I/O size, randomness, number of outstanding I/Os, and deduplication and compression ratio.

Below is a sample configuration file for FIO. We modified the parameters for different tests.

[global]
ioengine=libaio; async I/O engine for Linux
direct=1
thread ; use thread rather than process
group_reporting
; Test name: 4K_rd70_rand100_dedup0_compr0
runtime=3600
time_based
readwrite=randrw
iodepth=8
rwmixread=70
blocksize=4096
randrepeat=0
blockalign=4096
buffer_compress_percentage=0
dedupe_percentage=0

[job 1]
filename=/dev/sdb
filesize=25G
[job 2]
filename=/dev/sdc
filesize=25G

If steps 1 and 2 were not performed properly, the results could be unexpected. To further illustrate that, we take two issues we encountered as examples.

Issue #1: The SSD showed high read latency, but the SSD hardware had no issues

We observed a high device read latency issue with FIO micro-benchmarks. The high read latency occurred because we were issuing a large amount of concurrent I/O (outstanding I/O, also known as OIO) to the same Logic Block Address/LBA (or a small range of LBAs) on the SSD. This is more likely to happen with any type of deduplication solution, regardless of the storage vendor.

To resolve this issue, we first performed a test to learn the behavior of the SSD device. Below shows the read latency to one address with an increasing amount of outstanding I/Os.

4KB read from the same LBA:
1 OIO Latency: 0.12 ms
16 OIOs Latency: 1.51 ms
32 OIOs Latency: 3.06 ms
64 OIOs Latency: 6.07 ms
128 OIOs Latency: 12.08 ms
256 OIOs Latency: 12.68 ms

When we issued multiple OIOs to a single 4KB block, those I/Os were serialized to one single channel inside the SSD device that was connected to that offset. In other words, we lost the benefits of the SSD’s internal parallelism (from multiple channels). The device latency rose as we increased the number of OIOs. High OIO to the same LBA (or a smaller range of LBAs) caused high device read latency.

In the extreme case where we prepared the disk by zeroing out all the blocks, all the data was deduplicated to one block. As a result, the upcoming read I/O was issued to the same device address, which caused high device read latency as discussed above.

Figures 1 and 2 (stats from Virtual SAN observer tool) show sample results from our test. (Even though the screenshots show HDDs, our test was on an all-flash Virtual SAN cluster. The HDDs in the graph actually mean SSDs used as capacity tier devices.) As can be seen, inside one disk group, one capacity tier SSD (naa.55cd2e404ba2ce71 in Figure 1 or naa.55cd2e404ba535b7 in Figure 2) always shows higher than usual read latency (than in the other capacity tier SSDs). This is because we zeroed out the data blocks before running the test. Later in the test, a large amount of outstanding read I/Os were issued to a single address on that SSD.

Note: In Figures 1 and 2, where there are no units specified, the unit is milliseconds. Where an “m” is specified, the unit is microseconds. Where “k” is specified, the unit is thousands.

vsan-dedup-fig1

Figure 1. Sample test 1 showed the first capacity-tier SSD (“HDD”)  to have up to 3 milliseconds latency, which is much higher than the next two capacity-tier SSDs (also labelled “HDD”), which show just below 150 microseconds of latency.

vsan-dedup-fig2

Figure 2. Sample test 2 is similar to sample test 1. The bottom capacity-tier SSD (“HDD”) shows up to 6 milliseconds of latency, whereas the first two show slightly over 100 microseconds.

Issue #2: Deduplication ratio was not what we set

Because Virtual SAN distributes multiple virtual disks across its datastore, it is hard to determine the exact deduplication ratio of the data that the workload generated. In the FIO configuration file, we changed the dedupe_percentage to a desired value. However, in the testing system, there were a couple factors that affected the actual deduplication ratio reported by Virtual SAN.

  • I/Os from other virtual disks (vmdk files) can have the duplicated data. In the FIO configuration file, if the randrepeat parameter is set to 1, FIO will use the same random seed for all the disks. Although the data pattern obeys the dedupe_percentage set by the user for each vmdk, there will be high deduplicated data across vmdks. Note that those vmdks are placed on the same Virtual SAN datastore, which means that datastore will see more duplicated data than specified.
  • I/O size when preparing disks will affect the deduplication ratio. Currently, Virtual SAN uses 4KB chunk size as the unit to calculate the deduplication ratio. If the user uses non 4KB IO size to prepare the disk, Virtual SAN could see a different deduplication ratio. Meanwhile, if the IO is not aligned to 4KB (blockalign parameter), Virtual SAN could also observe a different deduplication ratio.

Figure 3 (below) shows a sample test in which we ran FIO with a 0% deduplication ratio. Due to the issues described above, Virtual SAN erroneously reports about an 80% deduplication ratio (shown in blue).

vsan-dedup-fig3

Figure 3. The blue line shows a deduplication percentage of about 80%, even though we set deduplication to be 0%.

To avoid these problems, we suggest performance testers use a 4KB I/O size (and aligned) and set randrepeat to 0 to prepare the disk in order to get the desired deduplication ratio. Note that Virtual SAN can properly handle any type of I/O configuration. The purpose of this blog is to explain the possible discrepancy between the FIO-specified dedupe_percentage and the Virtual SAN reported deduplication ratio if performance testers use different I/O configurations to evaluate the Virtual SAN datastore.

vsan-dedup-fig4

Figure 4. No blue line is shown, indicating the correct deduplication of 0%.