VCF Compute (vSphere) Workloads

64 Cores per NUMA Node Limit in Microsoft SQL Server: Recommendation for Efficiently Allocating Logical CPUs to SQL Server VMs on VMware vSphere

Beginning with Update 11, Microsoft SQL Server 2022 introduced a “breaking change” that limits the maximum number of logical processors in a single NUMA node supported by SQL Server to 64. According to Microsoft, SQL Server cannot start if it detects more than 64 logical processors in a NUMA node.

Consequently, Microsoft recommends you should either:

  1. Disable hyperthreading to limit the number of logical processors presented to a SQL Server instance, or
  2. Reduce the logical core count on bare-metal instances.

Broadcom does not support this recommendation in a VMware vSphere environment for the following reasons.

Regarding the recommendation to disable hyperthreading:

  1. This recommendation does not apply to on-premises or cloud-hosted vSphere virtual infrastructure.
  2. vSphere provides highly simplified, flexible configuration options to control how logical processors are distributed and presented to a virtual machine (VM) on the vSphere platform, and this presentation is what the guest operating system, in turn, presents to the hosted application.

Regarding the recommendation to reduce logical cores on the physical server:

  1. This does not apply to a physical host running VMware ESXi. As explained in the previous paragraph, vSphere/ESXi allows you to present desired NUMA topologies to VMs regardless of the host’s physical topology.
  2. Disabling physical processor core counts reduces the number of usable cores. This configuration choice keeps you from getting the full value from your hardware financial investments.
  3. Sub-NUMA clustering (SNC) or cluster-on-die (CoD) constrains performance for large workloads by splitting NUMA nodes into smaller chunks, limiting a large VM’s access to its allocated compute resources, and forcing transactions to cross NUMA boundaries.

Broadcom’s recommendation

The following describes, at a high level, how you can safely assign logical processors to their SQL Server VMs without disabling (and thereby depriving their workloads of the extra capacity benefits of) hyperthreading.

The example presented here is based on vSphere 8.0 (the latest version at the time of this writing), but the concept and options are available in all recent versions of vSphere.

On vSphere, logical processors can be assigned to a VM in one of the following two ways:

  • Let the hypervisor determine the optimal topology (default behavior).
  • Manually specify the number of cores per socket and the number of desired NUMA nodes.

Let the hypervisor determine the optimal topology (default behavior)

This option, shown in the screenshot below, results in a 2-socket, 2-NUMA node topology (each with 72 logical processors). This configuration will trigger the breaking change and prevent the SQL Server VM from starting.

A screenshot of a computerDescription automatically generated

Warning: Whether hyperthreading is enabled or not, this problem will still arise because the ESXi host contains more than 64 physical cores in a socket. Therefore, Broadcom does not recommend this option.

Manually specify the number of cores per socket and the number of desired NUMA nodes

The screenshot below shows that we selected 48 cores per socket—a lower number than the problematic 64. This indicates that a 4-socket topology, with no more than 48 logical CPUs per socket, will be visible to the guest operating system. (Also refer to Configure the CPU Topology of a Virtual Machine.)

If we further select the desired number of NUMA nodes we want to present (4, in this case), ESXi intelligently scales down the number of cores in each socket to 36 cores per socket to provide the total number of desired logical processors to present to our VM. This means that the guest operating system will see 4 NUMA nodes, each with 36 logical processors in it.

The screenshot below illustrates this manually configured topology that the SQL Server VM sees, irrespective of the hardware’s hyperthreading state and the ESXi host’s underlying physical NUMA topology.

Lastly, Microsoft SQL Server receives and works with this same presentation. As seen in the screenshots below, the SQL Server Soft NUMA feature accepts the presentation and uses the topology to construct its own NUMA topology.

Conclusion

With this sample configuration, you can successfully run your large-capacity SQL workloads on vSphere platforms and maintain the benefits of hyperthreading, which will not trigger the breaking change.

At Broadcom, reducing total cost of ownership (TCO) and improving return on investment (ROI) are the cornerstones of our customer-centric service excellence. Hyperthreading in the age of virtualization and cloud computing enables you to more efficiently use your procured hardware to its fullest capacity.

Since there is no known hyperthreading deficiency causing this issue, Broadcom advises you to use the built-in capabilities of ESXi’s compute resource management to manually configure the NUMA topologies for your mission-critical SQL Server workloads running in vSphere VMs. This will ensure that the topology presented to a SQL Server VM does not contain more than 64 logical processors in a single NUMA node.

Lastly, it is important to take into consideration the physical compute and NUMA characteristics when creating the virtual machine virtual Topology. While the solution above outlines creating additional smaller vNUMA nodes to solve the limitation, you must still ensure the vNUMA nodes and their vCPU counts are evenly divisible and fit into the physical socket(s).

Example:

Dual socket EPYC 9654 with SMT, where each socket has 96 cores and 192 threads.

In this case, wanting to use the whole host, you’d optimally create four vNUMA nodes, each with 48 vCPUs such that two vNUMA nodes would fit into each socket. This symmetrical sizing based on physical core count ensures optimal performance by not over-subscribing a socket.