Exploring FAST Cache Performance Using VMmark 2.1.1

A system’s performance is often limited by the access time of its hard disk drive (HDD). Solid-state drives (SSDs), also known as Enterprise Flash Drives (EFDs), tout a superior performance profile to HDDs. In our previous comparison of EFD and HDD technologies using VMmark 2.1, we showed that EFD reads were on average four times faster than HDD reads, while EFD and HDD write speeds were comparable. However, EFDs are more costly per gigabyte.

Many vendors have attempted to address this issue using tiered storage technologies. Here, we tested the performance benefits of EMC’s FAST Cache storage array feature, which merges the strengths of both technologies. FAST Cache is an EFD-based read/write storage cache that supplements the array’s DRAM cache by giving frequently accessed data priority on the high performing EFDs. We used VMmark 2, a multi-host virtualization benchmark, to quantify the performance benefits of FAST Cache. For more details, see the overview, release notes for VMmark 2.1, and release notes for 2.1.1. VMmark 2 is an ideal tool to test FAST Cache performance for virtualized datacenters in that its varied workloads and bursty I/O patterns model the demands of the datacenter. We found that FAST Cache produced remarkable improvements in datacenter capacity and storage access latencies. With the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements.

FAST Cache
FAST Cache is a feature of EMC’s storage systems that tracks frequently accessed data on disk, promotes the data into an array-wide EFD cache to take advantage of Flash I/O access speeds, then writes it back to disk when the data is superseded in importance. FAST Cache optimizes the use of EFD storage. In most workloads only a small percentage of data will be frequently accessed. This is referred to as the ‘working set.’ An EFD-based cache allows the data in the working set to take advantage of the performance characteristics of EFDs while the rest of the data stays on lower-cost HDDs. Relevant data is rapidly promoted into the cache in increments of 64 KB pages, and a least-recently-used algorithm is used to decide which data to write back to disk.

The benefit achieved with FAST Cache depends on the workload’s I/O profile. As with most caches, FAST Cache will show the most benefit for I/O with a high locality of reference, such as database indices and reference tables. FAST Cache will be least beneficial to workloads with sequential I/O patterns like log files or large I/O size access because these may not access the same 64 KB block multiple times and the FAST Cache would never become populated.

Configuration

Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs (per server): One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, Hyper-Threading enabled
Memory (per server): 32 GB DDR3 ECC @ 800 MHz
Storage Array: EMC VNX5500
FAST Cache configurations:
366 GB FAST Cache, 8 EFDs, RAID 1
92 GB FAST Cache, 2 EFDs, RAID 1
FAST Cache turned off
LUN configurations:
20 HDDs, 10K RPM, grouped into 3 LUNs of 8, 8, and 4 HDDs each
11 HDDs, 10K RPM, grouped into 3 LUNs of 4, 4, and 3 HDDs each
Hypervisor: ESXi 5.0.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1

Methodology
We used VMmark 2 to investigate several different factors relating to FAST Cache. We wanted to measure the performance benefit afforded by adding FAST Cache into a VMmark 2 environment and we wanted to observe how the performance benefit of FAST Cache would scale as we changed the size of the cache. We tested with FAST Cache turned off and with two different FAST Cache sizes which were made from two EFDs and eight EFDs in RAID 1, creating a cache of 92 GB and 366 GB usable space, respectively. FAST Cache was configured according to best practices to ensure FAST Cache performance was not limited by array bus bandwidth. After the FAST Cache was created, it was warmed up by repeating VMmark 2 runs until scores showed less than 3% variability between runs.

We also wanted to examine whether FAST Cache could reduce the hardware requirements of our tests. As processors and other system hardware components have increased in capacity and speed, there has been greater and greater pressure for corresponding increasing performance from storage. RAID groups of HDDs have been one answer to these increasing performance demands, as RAID arrays provide performance and reliability benefits over individual disks. In typical RAID configurations, performance increases nearly linearly as disks are added to the RAID group. However, adding disks in order to increase storage access speed can result in underutilization of HDD space, which becomes far greater than required. FAST Cache should allow us to reduce the number of HDDs we require for RAID performance benefits, also reducing the cluster’s total power, cooling and space requirements, which results in lower cost. FAST Cache services the bulk of the workloads’ I/O operations at high speeds, so it is acceptable for us to service the remainder of operations at lower speeds and use only as many HDDs as needed for storage capacity rather than performance.

To test whether an environment with FAST Cache and a reduced number of disks could perform as well as an environment without FAST Cache, but with a larger number of disks, we tested performance with two different disk configurations. Workloads were tested on a set of 20 HDDs and then on a set of 11 HDDs, in both cases grouped into three LUNs. Each LUN was in a distinct RAID 0 group. Due to the performance characteristics of RAID 0, we expected the 20 HDD configuration to have better performance to than the 11 HDD configuration. The placement of workloads onto LUNs was meant to model a naïve environment with nonoptimal storage setup. Two LUNs held workload tile data, and the third smaller LUN served as the destination for VM Deploy and Storage vMotion workloads. The first LUN held VMs from the first and third tiles, and the second LUN held VMs from the second and fourth tiles. Running VMmark 2 with more than one tile per LUN was atypical of our best practices for the benchmark. It created a severe bottleneck for the disk, which was meant to simulate the types of storage performance issues we sometimes see in customer environments.

All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers running ESXi 5.0. All components in the environment besides FAST Cache and number of HDDs remained unchanged during testing.

Results
To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare performance at each level of load for each cache and storage configuration. All data points are the mean of three tests in each configuration. Scaling data was generated by normalizing every score to the lowest passing score, which was 1 tile with FAST Cache turned off on 20 HDDs.

VMmark 2.1.1 Scaling With and Without FAST Cache

With FAST Cache turned off, the 20 HDD LUNs reached saturation at 2 tiles, and the 11 HDD LUNs were unable to support even 1 tile of load. Because all VMs for each tile were placed on the same LUN, a 1 tile run used one LUN, consisting of only four out of 11 HDDs or eight out of 20 HDDs. 4 HDDs were insufficient to provide the required QoS for even 1 tile. When FAST Cache was enabled, the 11 HDD and 20 HDD configurations supported 4 tiles. This is a remarkable improvement; with the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements. Even at lower load levels, the equivalent system with FAST Cache was allowing greater throughput and showed resulting increases in the VMmark score of 26% at 1 tile and 31% at 2 tiles. With FAST Cache enabled, the configuration with 11 HDDs performed equivalently to one with 20 HDDs until the system approached saturation.

With FAST Cache enabled, the system supported twice as much load on almost half as many disks. The results show that an environment with a 92 GB FAST Cache was able to greatly outperform a HDD-only environment that contains 82% more disks. At 4 tiles with FAST Cache enabled, the cluster’s CPU utilization was approaching saturation, reaching an average of 84%, but was not yet bottlenecked on storage.

In our tests, performance did not scale up very much as we increased FAST Cache size from 92 GB to 366 GB and the number of HDDs from 11 to 20.

We can see that all configurations scaled very similarly from 1 to 3 tiles with only minor differences appearing, primarily between the 92 GB FAST Cache and 366 GB FAST Cache. Only at the highest load level did performance begin to diverge. Predictably, the largest cache configurations show the best performance at 4 tiles, followed by the smaller cache configurations. To determine whether this performance falloff was directly attributable to the cache size and number of HDDs, we needed to know whether FAST Cache was performing to capacity.

Below are the FAST Cache and DRAM cache hit percentages for read and write operations at the 4 tile load. On average, our VMmark testing had I/O operations of 24% reads and 76% writes.

Click to Enlarge

With the 366 GB FAST Cache, nearly all reads and writes were hitting either the DRAM or FAST Cache. In these cases, the number of backing disks did not affect the score because disks were rarely being accessed. At this cache size, all frequently accessed data fit into the FAST Cache. However, with the 92 GB FAST Cache, the cache hit percentage decreased to 96.5% and 92.1% for the 11 HDD and 20 HDD configurations, respectively. This indicated that the entire working set could no longer fit into the 92 GB FAST Cache. The 11 HDD configuration began to show decreased performance relative to 20 HDDs, because although only 3.5% of total I/O operations were going to disk, the increase in disk latency was large enough to reduce throughput and affect VMmark score. Despite this, a FAST Cache of 92 GB was still sufficient to provide us with VMmark performance that met QoS requirements. The higher read hit percentages in the 11 HDD configuration reflected this reduced throughput. Lower throughput resulted in a smaller working set and an accordingly higher read hit percentage.

Overall, FAST Cache did an excellent job of identifying the working set. Although only 8% of the 1.09 TB dataset could fit in the 92 GB cache at any one time, at least 92% of I/O requests were hitting the cache.

Scaling FAST Cache gave us a sense of the working set size of the VMmark benchmark. As performance with the 92 GB FAST Cache demonstrated a knee at 3 tiles, this suggests the working set size at 3 tiles is less than 92 GB and the working set size at 4 tiles is slightly greater than 92 GB. Knowing the approximate working set size per tile would allow us to select the minimum FAST Cache size required if we wanted our entire working set to fit into the FAST Cache, even if we scaled the benchmark to an arbitrary number of tiles in a different cluster.

The results below show that I/O operations per second and I/O latency were affected by our environment characteristics.

The variability in read latency is clearly affected by both FAST Cache size and number of backing HDDs. Latency is highest with only 11 HDDs and the smaller FAST Cache, and decreases as we add HDDs. Latency decreases even more with the larger FAST Cache size as nearly all reads hit the cache. Write latency, however, is relatively constant across configurations, which is as expected because in each configuration nearly all writes are being served by either the DRAM cache or FAST Cache.

Summary
It’s clear that we can replace a large number of HDDs with a much smaller number of EFDs and get similar or improved performance results. An array with 11 HDDs and FAST Cache outperformed an array with 20 HDDs without FAST Cache. FAST Cache handles the workloads’ performance requirements so that we need only to supply the HDDs necessary for their storage space, rather than performance capabilities. This allows us to reduce the number of HDDs and their associated power, space, cooling, and cost.

Tiered storage solutions like FAST Cache make excellent use of EFDs, even to the extent that 92% or more of our I/O operations are benefitting from Flash-level latencies while the EFD storage itself holds only 8% of our total data. The increased VMmark scores demonstrate the ability of FAST Cache to pinpoint the most active data remarkably well, and, even in a bursty environment, show incredible improvements in I/O latency and in the load that a cluster can support. Our testing showed FAST Cache provides Flash-level storage access speeds to the data that needs it most, reduces storage bottlenecking and increases supported load, making FAST Cache a highly valuable addition to the datacenter.