Microsoft SQL Server Workloads on VMware vSAN
When you are thinking about designing, deploying, or migrating SQL server workloads into a vSAN cluster; it is important to understand not only the SQL workloads, but also what configurations are appropriate within a vSAN cluster in order to support such workloads. Not all SQL server workloads are equal, some call for high performance, some for large storage capacity, and others for both.
The following recommendations aim to help in selecting the right vSAN configuration for the different SQL Server workloads.
This common set of recommendations is relevant to any SQL Server workloads hosted on vSAN.
- Include an additional host (N+1) to the sizing results, where N = minimum number of hosts needed. For example, the minimum requirement for a vSAN cluster is to have at least three hosts, the recommendation, in this case, would be to have four or more hosts in this vSAN cluster. This additional capacity enables automatic remediation in the event of a drive, disk group, or host failure.
- Ensure that the network infrastructure used for vSAN network traffic is robust enough to sustain the planned workload: it is recommended a minimum of 10Gb of dedicated network bandwidth for vSAN network traffic. This is a requirement for All-Flash vSAN clusters. Redundant network connections are also highly recommended.
- While a single disk group per host is fully supported, it is recommended to have at least two disk groups per host, or more. Multiple disk groups can improve throughput in many cases and it provides better resilience to certain drive failure scenarios.
- vSAN Services on vSAN cluster level:
- vSAN Performance service provides performance metrics, and enables vROps or other monitoring tools to collect vSAN based performance data and efficiently troubleshoot the vSAN deployment. The performance service is enabled by default starting with the vSAN version 6.7. For lower vSAN versions consider enabling this service.
- Encryption. For high security demands, data at rest encryption (FIPS 140-2 compliant) might be enabled as a cluster-wide setting. Our storage stress validation demonstrated the performance of overall IOPS is very similar between encrypted and non-encrypted runs, but it will add CPU overhead to encrypt and decrypt data. We highly recommend using physical CPUs supporting the AES-NI offloading and verifying that the AES-NI feature is enabled in the server BIOS. For an end-to-end encryption solution, consider using IPSEC for channel encryption. If only selected SQL Server databases should be encrypted, SQL Server native encryption options can be used instead.
- vSAN 6.7 expands the functionality of the vSAN iSCSI Target service to provide the SCSI-3 persistent reservations support for shared disks for windows failover cluster if using the SQL Server FCI, high availability mode is a requirement. The vSAN iSCSI Target service at the vSAN cluster level should be enabled for this purpose.
- Configure and set proper SPBM for SQL Server data:
- Failures to tolerate: Ensure that an option with at least “1 (one) failure to tolerate” is selected. Do not use the option: “No data redundancy”.
- Number of disk stripes per object: Use default one (1), and consider spreading the data between multiple VMDKs attached to multiple vSCSI controllers.
- IOPS limit per object: vSAN 6.2 and later versions have a QoS feature that sets a policy to limit the number of IOPS that an object can consume. Do not use this setting for any performance demanding objects. This feature may be used to limit IOPS allocation for backup/restore operations to prevent saturating production workload. Use with caution as this will influence backup/restore time and the RTO/RPO as well.
Additional settings adjustments may help to gain more benefits to your workload. Use the section describing “Tier 1” workload for SQL Server workloads where performance is the priority, while the section “Tier 2” might fit better to the workloads where capacity and cost savings are more important.
Tier 1 – High performance Online transaction processing (OLTP) workload:
Typical SQL Server I/O activities for OLTP workloads include queries with many seek operations, checkpoint activity that flushes dirty pages to disk periodically, and transaction log writes. The in-flight IO of data is fairly small in size, and typically between 8K and 64K. TPC-E benchmarks are commonly used to reproduce the OLTP-like workloads.
For such workloads where performance, and availability are highly important, consider the following:
- All-flash vSAN deployments
- Consider using at least SAS SSD devices. A SAS SSD device will perform better than a SATA SSD device in most cases, and has a bigger queue depth. Newer devices such as NVMe, will yield better performance.
- Disable deduplication and compression (this is disabled by default in vSAN). You may implement compression on a per table basis at the database level (Enterprise edition of SQL Server is required) if needed.
- Set object space reservation – “Thick provisioning” for all VMDKs hosting SQL Server data and log files. Pre-allocating the space for SQL Server logs can reserve the space to avoid the out of space issue. The capacity will be reserved up front from the vSAN datastore. From the SQL Server side, we recommend granting Perform maintenance tasks privilege in the setup to initialize the data files instantaneously (available in the UI starting with SQL Server 2016). Pre-allocate the space for the database log, to avoid the auto growth happening, or to set the growth method to by GB instead of by percentage.
- Do not use the IOPS limit for object option
- RAID-1 mirroring and at least 1 failure to tolerate (FTT) for both data, and log VMDKs is the recommended failure tolerance method for this type of workload.
- Additional availability requirements might dictate the need to increase FTT and/or implement SQL Server high availability solutions like Always On Availability Groups. If FTT will be increased, make adjustments to the number of disks required to satisfy not only capacity, but performance requirements. If Availability Groups are enabled, you can have application level high availability as well as the underlying storage high availability, however more storage space, and a well-designed virtual machine network is required.
- If a multi-site availability is required, the “vSAN Stretched Cluster” configuration may be used to increase the data availability across data centres.
- It is also important to think about the network switches to be used during a vSAN deployment. While using configurations such as all-NVMe vSAN clusters, the physical disk controllers are no longer part of the configuration, and allows for such NVMe devices to process a lot of IO in a small amount of time., Enterprise grade switches (10Gbit or more) capable of having large buffer sizes (non-shared) is ideal in order to allow for more IO flow to the vSAN cluster.
Tier 2 – Online transaction processing (OLTP) workload:
Tier-2 workloads associate with general-purpose SQL Server deployments which do not have performance dominant requirements but need to be cost effective. All considerations posted below should not be applied to any infrastructure where performance is the primary goal.
- Hybrid vSAN might satisfy the requirements for such workloads if the following is taken into account:
- The use of multiple disk groups is strongly recommended to increase the system throughput
- It is important to have enough space in the caching tier. The general recommendation of the SSD as the caching tier for each host is to be at least 10 percent of the total storage capacity. However, the recommended SSD size should be at least two times that of the working data set.
- Select the appropriate SSD class to support planned IOPS. A SAS SSD device will perform better than a SATA SSD device in most cases, and has a bigger queue depth
- All-flash vSAN might yield satisfactory performance with better space savings that hybrid vSAN, at a lower costs with the following settings:
- Enable deduplication and compression if the workload is low write intensive and cost/space savings outweigh performance requirements.
- For the SQL Server data disks, using RAID 5/6 erasure coding to reduce space usage might be a choice, if space/cost savings are desired. The virtual disks for transaction logs should still be placed on the VMDK configured with RAID 1 policy. It is important to note that this recommendation applies to workloads that do not require high performance, and space savings are desired.
- Set object space reservation – “Thick provisioning” for all VMDKs hosting SQL Server log files. Pre-allocating the space for SQL Server logs can reserve the space to avoid the out of space issue. The capacity will be reserved up front from the vSAN datastore. From the SQL Server side, we recommend granting Perform maintenance tasks privilege in the setup to initialize the data files instantaneously (available in the UI starting with SQL Server 2016). Pre-allocate the space for the database log, to avoid the auto growth happening, or to set the growth method to by GB instead of by percentage.
Data Warehouse and/or Reporting workloads
Data warehouse applications issue scan-intensive operations that access large portions of the data at a time, as well as commonly performed bulk loading operations. These operations result in larger I/O sizes than OLTP workloads do and require a storage subsystem that can provide the required throughput. This makes the throughput, or megabytes per second (MB/s), the critical metric. Common data warehouse type applications include decision support applications. TPC-H benchmarks are commonly used to reproduce such workload.
For such workloads consider the following:
- In all-flash configurations, using NVMe SSD at the capacity tier may benefit the workload performance due to the read dominant query operations.
- In all-flash configurations where saving space is the priority, usage of RAID5/6 for data VMDKs may be a choice. Perform thoughtful testing as RAID1configuration will generally provide better performance.
- Set object space reservation – Pre-allocating the space for SQL Server logs can reserve the space to avoid the OUT OF SPACE issue. The capacity will be reserved up front from the vSAN datastore.
- Do not use the IOPS limit for object option. As throughput is the critical metric, enough bandwidth should be available for such workloads.
- Consider using high performance networking devices for the vSAN backend network. Use at least 10 Gbit switches with enough buffers to sustain high throughput.
Special thanks to Oleg Ulyanov (CPBU) and Tony Wu (SABU) for being co-authors for these recommendations.