Recently in partner workshops I have come across some interesting discussions about the impact of hyper-threading and NUMA in sizing business critical applications on VMware. So here is an SAP example based on SAP’s sizing metric “SAPS” (a hardware-independent unit of measurement that equates to SAP OLTP throughput of Sales and Distribution users). The examples here refer to vSphere scheduling concepts in this useful whitepaper The CPU Scheduler in VMware vSphere 5.1 .
SAP sizing requires the SAPS rating of the hardware which for estimation purposes can be obtained from certified SAP benchmarks published at http://www.sap.com/solutions/benchmark/sd2tier.epx . Let’s use certification 2011027 and assume that we plan to deploy on similar hardware as used in this benchmark. This is a virtual benchmark on vSphere 5 with the following result: 25120 SAPS (at ~100% CPU) for 24 vCPUs running on a server with 2 processors, 6 cores per processor and 24 logical CPUs as hyper-threading was enabled. This is a NUMA system where each processor is referred to as a NUMA node. (Note cert 2011027 is an older benchmark, the SAPS values for vSphere on newer servers with faster processors would be different/higher, hence work with the server vendors to utilize the most recent and accurate SAPS ratings).
In this example I will design for application server virtual machines which as they scale out horizontally gives us the flexibility of choosing the number of vCPUs per virtual machine. Now do we go with # of vCPUs = # of cores or # of vCPUs = number of logical CPUs? Let’s show an example for both. I will consider the following:
- SAP sizing is typically conducted at 60-70% CPU and normal practice is to scale down the benchmark SAPS results, I will not bother with this and go with the 25120 SAPS at 100% CPU.
- Size within the NUMA boundaries. In this two processor NUMA system example, there are two NUMA nodes each with one processor and memory. The access to memory in the same node is local; the access to the other node is remote. The remote access takes more cycles because it involves a multi-hop operation so keeping the memory access local improves performance.
- For a 6 core NUMA node the virtual machine vCPU size should be a multiple divisor (or multiple) of 6 giving us 1, 2, 3 or 6 way VMs (see this VMware blog).
- I assume workloads in all the virtual machines peak at the same time.
Let’s first show a design with # of vCPUs = # of cores i.e. no vCPU over-commit.
Example 1: # of vCPUs = # of cores, 2-way and 6-way app servers
With all the virtual machines under load simultaneously, the ESXi scheduler by default, with no specific tuning will: allocate a home NUMA node for the memory of each virtual machine; schedule vCPUs of each virtual machine on its home node thus maintaining local memory access; schedules each vCPU on a dedicated core to allow exclusive access to core resources. (Note that in physical environments such NUMA optimizations would require OS commands to localize the processing e.g. Linux command “numactl”) However the above configuration does not give us 25120 SAPS as not all the logical CPUs are being utilized as was the case in the benchmark. The hyper-threading performance boost for an SAP OLTP workload is about 24% (based on tests by VMware performance engineering – see this blog) so for # of vCPUs = # of cores we should theoretically drive about 25120/1.24 = 20258 SAPS. Also we can estimate about 20258/12 = 1688 SAPS per vCPU so the 2-way virtual machine is rated at 1688 x 2 = 3376 SAPS and the 6-way is 1688 x 6 = 10128 SAPS (@100% CPU in this example). Are we “wasting” SAPS by not utilizing all the logical CPUs? Technically yes but for practical purposes not a major issue because:
- We have some CPU headroom which can be claimed back later after go-live when the virtual machines can be rebalanced based on the actual workload. At this point vCPU over-commit may be possible as virtual machine workloads may not peak at the same time.
- Hyper-threading benefit is dependent on the specific workload and while the 24% hyper-threading boost is based on an OLTP workload profile, the actual workload may be less impacted by hyper-threading for example:
- CPU intensive online reporting
- CPU intensive custom programs
- CPU intensive batch jobs
- SAP has created another metric in their sizing methodology referred to as SCU – Single Computing Unit of performance. SAP has categorized different workloads/modules based on their ability to take advantage of hyper-threading. So some workloads may experience a hyper-threading benefit lower than 24%.
Now what if we need to drive the maximum possible SAPS from a single server – this is when we would need to configure # of vCPUs = # of logical CPUs. The following configuration can achieve the maximum possible performance.
Example 2: # of vCPUs = # of logical CPUs
In the above design the virtual machine level parameter “numa.vcpu.preferHT” needs to be set to true to override default ESXi scheduling behavior. Default behavior is where ESXi schedules the virtual machine across NUMA nodes when the number of vCPUs for a single virtual machine is greater than the number of cores in the NUMA node. This results in vCPUs of a virtual machine being scheduled on a remote node relative to its memory location. This is avoided in the above example and performance is maximized because: ESXi schedules all vCPUs of each virtual machine on the same NUMA node that contains the memory of the virtual machine thus avoiding the penalty of any remote memory access; all logical CPUs are being used thus leveraging the hyper-threading benefit (note vCPUs are sharing core resources so the SAPS per vCPU in this case is 25120/24 = 1047 at 100% CPU). This configuration is commonly used in the following situations: running a benchmark to achieve as much performance as possible (as was done for the app server virtual machines in the 3-tier vSphere SAP benchmark certification 2011044); conducting physical versus virtual performance comparisons. For practical purposes designing for # of vCPUs = # of logical CPUs may not be so critical. If we were to design for a 12-way app server (example 2 above), and actual workload was less than planned with lower CPU utilization, we would have plenty of vCPUs without the added gain from hyper-threading. There are no hard rules so if desired, during the sizing phase, you can start with # vCPUs = # of cores or number of vCPUs = # of threads based on which approach you think best fits your needs.
Summarizing I have shown two sizing examples for SAP application server virtual machines in a hyper-threaded environment. In both cases sizing virtual machines within NUMA nodes helps with performance. The SAPS values shown here are based on a specific older benchmark certification and would be different for modern day servers and more recent benchmarks.
Finally a thank-you to Todd Muirhead (VMware performance engineering), for his reviews and inputs.