Scaling Performance Analytics Data Infrastructure Beyond 100 Million Time-Series on VMware vSAN

By Yu-Chen Lin and Chien-Chia Chen

VMware’s Performance Engineering team develops and operates many critical performance engineering services across the VMware product portfolio. This blog shares how we improve engineering efficiency by leveraging VMware products, using performance analytics data infrastructure as an example.

Our Story

Data is the foundation of all kinds of performance engineering work. Performance optimizations, analysis, regression tracking, monitoring and alerting, anomaly detections, and sizing recommendations—all of them are based on data, and thus the performance, scalability, and availability of the data infrastructure greatly determines the engineering efficiency of the Performance Engineering team. This critical data infrastructure not only stores terabytes of generated performance data from in-house testbeds every year, but also deals with a large amount of telemetry data from production deployments.

The primary type of performance data we store is numerical time-series data; that is, numerical readings of tens of thousands of performance metrics measured periodically. The Performance Engineering team uses a common open-source time-series database to store these time-series data, which shares the same known limitation of all common time-series databases: they all scale poorly as the cardinality of the time-series data grows. Cardinality can be understood as the number of unique time-series.

There are two common solutions to address time-series data with high cardinality:

First, delete old data by setting retention policies.
Second, spread them across multiple instances by sharding.

Setting retention policies is not an option for the Performance Engineering team because of the special requirement of some performance work that deals with data that are extremely difficult to reproduce. Therefore, the Performance Engineering team tends to keep data for a very long period of time, which results in extremely high cardinality with rapid growth.

Figure 1 below shows that from 2019 to 2021, the cardinality of our data infrastructure grew from 6 million to 160 million, which is beyond the scale that nearly all open-source time-series data can handle. Sharding is our last resort, but it comes with a nontrivial amount of engineering efforts given our open-source time-series database does not natively support sharding.

*Figure 1. Performance analytics data infrastructure cardinality growth*

Originally, the data infrastructure was hosted on a single machine with a single direct-attached, 2 terabyte NVMe SSD drive. This hardware configuration worked great initially when the cardinality of our data infrastructure was well below 10 million. However, it soon started to hit the cardinality scaling issues described above, and it also presented many operational challenges. The major one is the risk of disk failures. There was zero redundancy in our initial configuration and that single NVMe SSD drive stored the only copy of all the critical performance data for the entire team. The direct-attached configuration also prevented us from upgrading or reconfiguring the machine without downtime. However, the major pain the team felt was still the poor performance due to high cardinality that caused the database to severely degrade at least once every month resulting in hours of very poor ingestion and query performance. Fortunately, this is where VMware vSAN came to the rescue—it solved our operational challenges by providing a highly available and elastic distributed storage solution.

However, we still needed to size and tune vSAN properly, so it could meet the I/O demand of our databases. Databases are generally optimized for throughput, and thus they do very large IOs. During certain operations, such as backups, databases can also issue a large number of concurrent IOs, also known as outstanding IOs (OIOs). All of these require a certain level of application and guest operating system tuning specifically for vSAN.

Based on our vSAN performance expertise, for large IOs, the IO sizes are best to be no larger than 64 kilobytes. For high OIOs, the max OIOs should be limited at 128, but for 64 kilobytes or larger IOs, 32 or lower OIOs may yield lower IO latency, depending on the performance of the underlying physical disks. Some applications have options to control the size of their IOs, but in our case, we needed to modify the application code to adjust IO sizes. For OIOs, most of the guest operating systems offers options at the guest block device level to limit their maximum queue depths, which are equivalent to OIOs, and this is the approach we adopted. With these careful tunings, vSAN performs very well and meets all the performance requirements for our database workloads.

After moving the performance analytics data infrastructure to vSAN, the first noticeable improvement is the databases no longer degrade even with our largest time-series database instance, which has a cardinality of over 180 million now. Figure 2 below shows the normalized daily ingestion throughput, where there are at least four severe database degradations before migrating to vSAN (the red line and red dots). Ever since the databases moved to vSAN, they no longer degrade periodically and the only two cases when the ingestion throughput drops are irrelevant to databases (the blue line and blue dots); they are due to networking issues within VMware’s data center, which slowed down ingestion.

*Figure 2. Normalized daily ingestion throughput (pre-migration vs. post-migration)*

Table 1 below summarizes the reductions in ingestion latency after migrating to vSAN. The median latency is reduced by 30%, which means half of the ingestion requests are now completed 30% faster. vSAN also gives us much lower high-percentile latency—P90, P99, and P99.99 latency is reduced by 40%, 43%, and 79% respectively. Beyond these significant performance improvements, vSAN also eases the data infrastructure operations thanks to its redundancy and high-availability features that allow maintenance work to be done on a rolling basis with zero downtime.

	Median	P90	P99	P99.99
%-Difference	-30%	-40%	-43%	-79%

Table 1. Reductions in ingestion latency (pre-migration vs. post-migration)

Best Practices for Database Workloads on vSAN

This blog demonstrates how VMware vSAN helps the VMware Performance Engineering team by providing superior performance and features that ease infrastructure operations. We would also like to share our findings by highlighting the key best practices when running database workloads on vSAN:

Sizing is important
Before migrating an existing workload to vSAN, it is important to first quantify the peak resource requirements of the workload and size vSAN properly. For vSAN and the databases to perform well, they need enough compute (CPUs and memory) as well as storage.
- Insufficient CPU resources may result in CPU contention that will significantly diminish vSAN performance and any CPU-intensive database operations.
- Insufficient physical memory may cause swapping that will slow down IOs.
- Insufficient storage may result in high vSAN storage capacity utilization, which can trigger proactive rebalancing that will slow down guest IOs.

Limit writes to be no larger than 64 kilobytes
vSAN has certain properties that work best with 64-kilobyte IOs. Larger than 64 kilobytes might not always yield higher throughput. Different applications might have different options to tune the size of IOs. If vSAN does not meet the throughput requirement of your application and it offers such options, tuning the application’s IO sizes may help.

Limit OIOs to be no larger than 128
vSAN also has certain properties that could incur additional latency when OIO is high. As a guideline, more than 128 OIOs may result in higher latency. For large writes, it may work better with 32 or even fewer OIOs. Some applications may offer options to control their OIOs. If your applications do not have such options, most of the guest operating systems provide various means to do so. For example, Linux has an nr_requests option at the block device level to limit the queue depth of a block device.

All the sizing and tuning may heavily depend on applications. Our experiences above share a general guideline and demonstrate the capability of vSAN for hosting a large-scale data infrastructure.