posted

3 Comments

As some of you die-hard SAP HANA on Virtual SAN fans know we have been working behind the scenes testing and validating SAP HANA on Virtual SAN. Well here is a glimpse of that hard work.  Let me insert the official phrases to make everyone happy…. SAP does not support HANA on Virtual SAN or any HCI architecture and solutions today, VMware is working with SAP to jointly determine the best approach to achieving SAP HANA support for Virtual SAN. Lastly, this is a preview of the upcoming reference architecture of SAP HANA on VMware Virtual SAN All-Flash.

From our testing so far with SAP HANA and other databases like Oracle or MSSQL, we saw performance figures well suited to support the most demanding SAP workload. Besides being able to meet the performance requirements of modern business applications and databases, VSAN is now ready for the enterprise as it supports storage features, like Storage Based Policy Management, needed to run the most critical data on it.

With VMware Virtual SPBM (Storage Policy Based Management), the storage policy is set at the VMDK level on the Virtual SAN datastore. By changing the storage policy, SAP HANA database VM could be applied different profiles for different purposes.

1

In this blog, I will cover what the performance impact is with each of those policy settings under different workloads. OK now let’s review our configuration that we used to test SAP HANA on VMware Virtual SAN:

All-Flash Virtual SAN Specifications

  • Server Specification (per host):
    • 2x Xeon E5-2670 v3 @2.3 GHz 12-core
    • 256GB RAM
    • SSD: 2 x 400GB Solid State Drive (Intel SSDSC2BA40) as Cache SSD
    • SSD: 8 x 400GB Solid State Drive (Intel SSDSC2BX40) as Capacity SSD
    • ESXi version: 6.0 U2
  • SAP HANA Database VM Configuration, we were sizing the VM based on this:
    • 24 vCPU
    • 230GB RAM
    • Disk0: 100GB
    • Data Disk: 690GB on VMware Paravirtual-1
    • Log Disk: 230GB on VMware Paravirtual-2
    • Backup Disk: 690GB on VMware Paravirtual-3 (if needed)
    • OS: SUSE Linux Enterprise 11 sp3 64bit
    • SAP HANA Database Server: 1.00.112.02
    • HWCCT File System Test Parameters:
      • async_write_submit_active = on
      • async_write_submit_blocks = all
      • param async_read_submit = on
      • max_parallel_io_requests = 256

Preparing to Run SAP HANA on Virtual SAN

Virtual SAN 6.2 introduced new features like checksum, space efficiency (deduplication and compression), and erasure coding (RAID 5/RAID 6). For validating whether the new features can support SAP HANA and evaluating the new features’ cost impact to the performance result, we deployed one SAP HANA database VM into Virtual SAN Cluster and conducted HWCCT File System Tests by using five different Virtual SAN configurations and storage policies.

As shown in the table below, all the storage policies have been set stripe width to 8 because there are 8 disk groups in this cluster, increasing the stripe width will potentially have all the disk groups engaged in processing the workload to improve the performance.

Test 1a is the traditional thin VMDK provision while Test 1b is the thick-lazy-zero VMDK provision, from Tests 1c to 1e, the new features of Virtual SAN 6.2 are enabled. For the first four test cases, the storage policies defined were applied to both data and log disks. With regards to Test 1e, RAID 5 is only applied on the data disk and the log disk remains as RAID 1. In this test case, we ran HWCCT against the data disk only since there’s no change to the log disk.

Chen_SAP_table1
For all of the above test cases, the KPIs (Key Performance Indicators) were achieved for HWCCT File System tests. Now, let’s review some of the data from two test cases of the HWCCT testing for each of those scenarios as a demonstration: 1MB random I/O on Data disk and 4K sequential I/O on Log disk.

To present the performance differences among different Virtual SAN configurations, the results of Test 1a were set to 100% as the baseline.

1MB Random I/O on Data disk

First, let’s discuss the write throughput.

From the diagram below, we can easily see that thick-lazy provision without any features enabled has the best throughput number across these five cases. That’s because thick provision disks reserve the space in the storage while it’s being provisioned, which can help to avoid object distribution imbalance, so all the workload will be evenly distributed across all the disk groups as well.

The throughput of tests 1a and 1d are almost identical. Why? Because the same storage policy is used in both of the test cases.  The only difference is Test 1d had space efficiency (deduplication and compression) enabled. If you’re scratching your head, keep in mind the size of data created by HWCCT File System Test is very small so all the workload is staying on the cache tier. On the other hand, deduplication and compression will be executed only if the data is destaging from the cache tier to the capacity tier.  Therefore, in the case of HWCCT File System testing, turning on deduplication and compression will not impact the results.

The test cases with checksum (1c) and erasure coding enabled (1e) also met the KPI even though the write throughput results were lower than having them disabled. This is due in part to the price of write amplification.

2

Here we come to the read performance. As shown in the following diagram, Test 1b had the best read performance, for all the other scenarios, the difference of read performance is relatively the same since all of the test scenarios were based on thin provisioning. Also, the workload was staying on cache tier because the data set of HWCCT testing is small. On the other hand, I/O amplification doesn’t affect read workload so the read performance of Test 1c and 1e are almost same and even slightly better than that of the baseline.

3

4KB Sequential I/O on Log disk

From the overwrite latency perspective, the lower the better, all the Virtual SAN configurations can perform 4K sequential I/O under 400 microseconds and the differences among those cases could be ignored since it’s relatively small; only around 60 microseconds.

4

For the overwrite throughput, since we only applied the erasure coding to the data disk, and the log disk was applied with the same policy as Test 1a (we did not have the scenario 1e involved in this case). The difference of overwrite throughput among Test 1a, 1b and 1d is within 7% and which again is relatively small. Since the smaller block size I/O is impacted less by the write amplification, the throughput of scenario 1c, which has checksum enabled, is only about 10% less than the baseline.

5

We use the Virtual SAN Performance Service to observe the backend performance. The Performance Service can be customized for a specific time range. During our testing, as shown below we picked the latency graph of the Virtual SAN backend during the 4KB sequential I/O testing and the write latency is consistently below 600 microseconds.

6

It’s fairly obvious that Virtual SAN can support SAP HANA by having Virtual SAN 6.2 new features enabled. However, if you want to turn on the new features and scale out the SAP HANA databases in the cluster (for example, three independent SAP HANA VMs each on a different host), make sure the VMs can meet HWCCT File System KPI first.

Performance of SAP HANA Backup and Recovery on Virtual SAN

There’s no need to emphasize the importance of backup and recovery for enterprise databases. Routine backup job is expected to impact the database performance so optimizing the configuration to reduce the impact becomes particularly important.

Recovery doesn’t happen very frequently. However, when you need to recover, time is of the essence. Several factors can determine the proper recovery time objective. For our limited scope, we focused on the different configurations.

We considered the following aspects when comparing different configurations:

  • Impact on database performance
  • Backup time
  • Recovery time

From a datastore perspective, deduplication and compression is enabled on Virtual SAN. An NFS datastore is introduced to mount to each of the ESXi hosts as an external storage. Both datastores are considered as the candidates of the backup destination.

For the SAP HANA VM, one additional 690GB thin provisioned backup VMDK was added using a dedicate PVSCSI controller. For this backup VMDK, the storage policy is considered as another criterion to differentiate those test scenarios if this VMDK resides on the VSAN datastore.

The test scenarios below evaluate the impact to database performance by comparing the data execution performance while doing the backup with different storage configurations and all of the scenarios configuration met the HWCCT KPI. Those 4 test scenarios designed by having different storage policies as well as different backup VMDK placement.

Chen_SAP_table2

As the first test step, we used scripts to create 48 tables with 10 columns of each and insert 8 million rows into each table. The following diagram shows the resource usage after data generation.

new_resource_usage

Then we used hdbsql to do the full data backup to the path that is mounted to the backup VMDK. This command is sent out to each SAP HANA database VM as soon as data execution (find the max number from 8 million rows of each column) is triggered.

8

After the data is backed up and data execution is done, we dropped the tables generated by the scripts for testing data recovery from the backup. All the data execution, backup and recovery jobs happened simultaneously in all the SAP HANA VMs.

Single SAP HANA Database VM Backup and Recovery Performance

Firstly, we are comparing the single VM scenarios 2a and 2b, due to the write amplification caused by erasure coding, data execution performance is better and it has less impact when backing up data to RAID 1 VMDK than to RAID 5 VMDK.

9

Also the backup speed of to the VMDKs with RAID 1 is about 2.5 times faster than to the VMDKs with RAID 5 because data backup is write heavy workload, the backup speed of test 5a is around 322MB/s and from Virtual SAN backend perspective, the throughput reached 710MB/s.

10

With regards to the speed of data recovery, there’s almost no difference between 2a and 2b since the data recovery is read heavy workload and erasure coding has minimal impact on the performance. Both of the test cases recovered all the data in five minutes.

Backup and Recovery Performance of Four SAP HANA Database VMs

Secondly, let’s compare the test cases of VMDK on Virtual SAN with VMDK on external NFS datastore. We took the average value of the data across all four VMs.

The figure below illustrates these two configurations.

By looking at the average data execution time between 2c and 2d, we can easily see that having backup VMDK on the external storage has less performance impact on the database performance because backing up to Virtual SAN datastore increases the write heavy workload to Virtual SAN while processing the database workload. However, the backup and recovery speed is absolutely dependent upon the performance of that external storage, the average backup speed of scenario 2c (backup on Virtual SAN RAID 1) is more than 3 times better than that of scenario 2d while the recovery time in 2c is about 25% of that in 2d.

12

In short, it costs less backup and recovery time by using Virtual SAN but there is a performance impact on the production database. If database performance during backup and recovery is a concern, consider external storage for better database performance during backup and recovery operations.

However, if there’s no suitable external storage for backup and recovery, then utilizing a backup VMDK on Virtual SAN can shorten the backup and recovery window, which is an alternative way of designing data backup and recovery architecture.

Summary

In conclusion, VMware Virtual SAN is a great fit for SAP HANA. The performance results prove that Virtual SAN, even with the new 6.2 features enabled, can handle the workload. Moreover, Virtual SAN can also deliver a rapid backup and recovery platform for SAP HANA while still servicing the production database. Stay tuned for the upcoming comprehensive reference architecture.