Home > Blogs > VMware VROOM! Blog


Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb

[Updated 20 Oct 2017 – ‘NEW’ with Memory Considerations]

Using virtualization, we have all enjoyed the flexibility to quickly create virtual machines with various virtual CPU (vCPU) configurations for a diverse set of workloads.  But as we virtualize larger and more demanding workloads, like databases, on top of the latest generations of processors with up to 24 cores, special care must be taken in vCPU and vNUMA configuration to ensure performance is optimized.

Much has been said and written about how to optimally configure the vCPU presentation within a virtual machine – Sockets x Cores per Socket

 

numcorespersocket

 

Side bar: When you create a new virtual machine, the number of vCPUs assigned is divided by the Cores per Socket value (default = 1 unless you change the dropdown), to give you the calculated number of Sockets.  If you are using PowerCLI, these properties are known as NumCPUs and NumCoresPerSocket. In the example screenshot above, 20 vCPUs (NumCPUs) divided by 10 Cores per Socket (NumCoresPerSocket) results in 2 Sockets. Let’s refer to this virtual configuration as 2 Sockets x 10 Cores per Socket.

 

History

The ability to set this presentation was originally introduced in vSphere 4.1 to overcome operating system license limitations. As of vSphere 5, those configuration items now set the virtual NUMA (vNUMA) topology that is exposed to the guest operating system.

NUMA is becoming increasingly more important to ensure workloads, like databases, allocate and consume memory within the same physical NUMA node that the vCPUs are scheduled.  When a virtual machine is sized larger than a single physical NUMA node, a vNUMA topology is created and presented to the guest operating system.  This virtual construct allows a workload within the virtual machine to benefit from physical NUMA, while continuing to support functions like vMotion.

While the vSphere platform is extremely configurable, that flexibility can sometimes be our worst enemy in that it allows for many sub-optimal configurations.

Back in 2013, I posted an article about how Cores per Socket could affect performance based on how vNUMA was configured as a result.  In that article, I suggested different options to ensure the vNUMA presentation for a virtual machine was correct and optimal. The easiest way to achieve this was to leave Cores per Socket at the default of 1 which present the vCPU count as Sockets without configuring any virtual cores.  Using this configuration, ESXi would automatically generate and present the optimal vNUMA topology to the virtual machine.

However, this suggestion has a few shortcomings. Since the vCPUs are presented as Sockets alone, licensing models for Microsoft operating systems and applications were potentially limiting by the number of sockets.  This is less of an issue today with operating system core-based licensing, which Microsoft transitioned to starting with Windows Server 2016, but is still a consideration for earlier releases.

 

Example:

Since both Windows Server 2012 and 2016 only support up to 64 sockets, creating a “monster” Windows virtual machine with more than 64 vCPUs requires us to increase the Cores per Socket so the guest can consume all the assigned processors.

Example:

A virtual machine with 8 Sockets x 1 Core per Socket, hosting a single Microsoft SQL Server 2016 Standard Edition license, would only be able to consume 4 of the 8 vCPUs since that edition’s license limits to “lesser of 4 sockets or 24 cores”.  If the virtual machine is configured with 1 Socket x 8 Cores per Socket, all 8 vCPUs could be leveraged (reference: https://msdn.microsoft.com/en-us/library/ms143760.aspx).

 

Additionally, some applications, like Microsoft SQL Server 2016, behave differently based on the Cores per Socket topology presented to them.

 

Example:

A virtual machine, hosting a Microsoft SQL Server 2016 Enterprise Edition license, created with 8 Sockets x 2 Cores per Socket may behave differently than a virtual machine created with 2 Sockets x 8 Cores per Socket, even though they’re both 16 vCPUs.  This is due to the soft-NUMA feature within SQL Server which gets automatically configured based on the number of cores the operating system can use (reference: https://msdn.microsoft.com/en-us/library/ms345357.aspx).

 

vNUMA Behavior Changes in vSphere 6.5

In an effort to automate and simplify configurations for optimal performance, vSphere 6.5 introduced a few changes in vNUMA behavior.  Thanks to Frank Denneman for thoroughly documenting them here:

http://frankdenneman.nl/2016/12/12/decoupling-cores-per-socket-virtual-numa-topology-vsphere-6-5/

Essentially, the vNUMA presentation under vSphere 6.5 is no longer controlled by the Cores per Socket value. vSphere will now always present the optimal vNUMA topology unless you use advanced settings.

That said…

 

vNUMA Considers Compute Only

When a vNUMA topology is calculated, it only considers the compute dimension. It does not take into account the amount of memory configured to the virtual machine, or the amount of memory available within each pNUMA node when a topology is calculated. So, this needs to be accounted for manually.

 

Example:

An ESXi host has 2 pSockets, each with 10 Cores per Socket, and has 128GB RAM per pNUMA node, totalling 256GB per host.

If you create a virtual machine with 128GB of RAM and 1 Socket x 8 Cores per Socket, vSphere will create a single vNUMA node. The virtual machine will fit into a single pNUMA node.

If you create a virtual machine with 192GB RAM and 1 Socket x 8 Cores per Socket, vSphere will still only create a single vNUMA node even though the requirements of the virtual machine will cross 2 pNUMA nodes resulting in remote memory access. This is because only the compute dimension in considered.

The optimal configuration for this virtual machine would be 2 Sockets x 4 Cores per Socket, for which vSphere will create 2 vNUMA nodes and distribute 96GB of RAM to each of them.

 

So how do we make this easier?

Based on the value a compute configuration can provide, these are some simple rules that can be followed to ensure the optimal configuration is implemented.

 

I propose the following Rules of Thumb:

  1. While there are many advanced vNUMA settings, only in rare cases do they need to be changed from defaults.
  2. Always configure the virtual machine vCPU count to be reflected as Cores per Socket, until you exceed the physical core count of a single physical NUMA node OR until you exceed the total memory available on a single physical NUMA node.
  3. When you need to configure more vCPUs than there are physical cores in the NUMA node, OR if you assign more memory than a NUMA node contains, evenly divide the vCPU count across the minimum number of NUMA nodes.
  4. Don’t assign an odd number of vCPUs when the size of your virtual machine, measured by vCPU count or configured memory, exceeds a physical NUMA node.
  5. Don’t enable vCPU Hot Add unless you’re okay with vNUMA being disabled.
  6. Don’t create a VM larger than the total number of physical cores of your host.

 

Example:

This table outlines how a virtual machine should be configured on a dual socket 10 core physical host to ensure an optimal vNUMA topology and performance regardless of vSphere version where the assigned memory is less than or equal to a pNUMA node.

Example:

This table outlines how a virtual machine should be configured on a dual socket 10 core physical host to ensure an optimal vNUMA topology and performance regardless of vSphere version where the assigned memory is greater than that of a pNUMA node.

 

To sum it up, these are some of the benefits provided using these rules of thumb:

  • The resulting vNUMA topology will be correct and optimal regardless of the version of vSphere you are using.
  • Reduction in remote memory access since a virtual machine will be contained within the fewest number of physical NUMA nodes.
  • Proper alignment with most operating system and application licenses models.
  • Provides the in-guest application the opportunity for optimal self-configuration.

There are always exceptions to ‘Rules of Thumb’ so let’s explore yours in the comments section below.

This entry was posted in Virtualization, Web/Tech and tagged , , , , , , , , , on by .
Mark Achtemichuk

About Mark Achtemichuk

Mark Achtemichuk currently works as a Staff Engineer within VMware’s R&D Operations and Central Services Performance team, focusing on education, benchmarking, collaterals and performance architectures.  He has also held various performance focused field, specialist and technical marketing positions within VMware over the last 7 years.  Mark is recognized as an industry expert and holds a VMware Certified Design Expert (VCDX#50) certification, one of less than 250 worldwide. He has worked on engagements with Fortune 50 companies, served as technical editor for many books and publications and is a sought after speaker at numerous industry events.  Mark is a blogger and has been recognized as a VMware vExpert from 2013 to 2016.  He is active on Twitter at @vmMarkA where he shares his knowledge of performance with the virtualization community. His experience and expertise from infrastructure to application helps customers ensure that performance is no longer a barrier, perceived or real, to virtualizing and operating an organization's software defined assets.

32 thoughts on “Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb

  1. Pingback: Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb – VMpro.at

  2. Troy

    If your vm’s were created the old way (vcpu’s set by socket count), is it worthwhile to go back and change them to the core-count method after upgrading to vSphere 6.5? Ie…VM’s currently having 4 sockets x 1 core/socket. Should this be now changed to 1 socket x 4 cores/socket?

    Or should this new rule of thumb only be used when creating new vm’s going forward? If so, does that mean the template vm’s should be changed after upgrading to vSphere 6.5? Or would you need to rebuild your templates from scratch using the new rule-of-thumb?

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      The configuration of the sockets and cores for the virtual machine becomes VERY important when you cross a pNUMA node.

      When the virtual machine is smaller than a pNUMA node, these will ‘probably’ be less impact since the virtual machine will be scheduled into 1 pNUMA node anyways.

      So I’d suggest an approach like:

      1) re-configuring all VM’s that are larger than a pNUMA node (priority)
      2) reconfigure templates so all new VM’s are created appropriately (priority)
      3) evaluate the effort of re-configuring everything else (good hygiene) <- personally I'd do this

      Reply
  3. Ak

    What’s the tradeoff in performance vs enabling hot-add vCpu and doing 1 core per socket being able to start small and only hot add cpus 1 at a time as needed. App owners historically are very bad at ‘right-sizing’. Also by disabling hot-add, you are taking away one of our best features of the platform over other hypervisors.

    Reply
    1. Klaus

      vNUMA will be disabled when CPU Hot-Add is enabled. So, if your VM spans over more than one physical NUMA node, they will not be presented within the VM. The guest operating system will see one large NUMA node with all cores and with “UMA” memory. It will not be able to tell which memory is local or CPU remote. NUMA aware applications like SQL Server database engine will greatly suffer.
      Thus, you will have the performace hit that Sysinternal’s coreinfo.exe could tell you.

      Reply
    2. Mark AchtemichukMark Achtemichuk Post author

      As Klaus said – if your app is NUMA aware (MS SQL being the best example) – you want vNUMA enabled with an optimal config.

      Now if your VM is smaller than a pNUMA node, then the benefit may or may not be measurable, since the VM will be scheduled into a 1 pNUMA node. So you could continue to leverage hot add for small VM. I highly recommend not adding hot add into templates. That said, my experience is that hot add highlights a miss in operational processes. I might suggest evaluating a more comprehensive capacity and performance tool like vROPs and its right sizing reports.

      Reply
      1. Fouad

        Thanks for the clarification. When Hot add activated is it possible that a 4vCPU VM run on 2 pNuma node? Because it will see just a pool with all the core,without distincting the pNuma
        I Repeat what you said to be sure I fully understand
        Thanks

        Reply
        1. Mark AchtemichukMark Achtemichuk Post author

          Hot-Add CPU will only turn off the vNUMA presentation to that virtual machine (which incidentally doesn’t matter in this case since by default vNUMA is only enabled at 9 vCPU or larger). vSphere will still schedule that 4 vCPU virtual machine into a single pNUMA node.

          Reply
  4. Pingback: Tuning | Pearltrees

  5. JD

    Lets say I have a Host with 4 Sockets x 8 Cores per Socket and 512 GB RAM (slots fully loaded). My VM is 8 vCPU with > 256 GB RAM. According to the list above, I would have 1 socket by 8 cores leaving me with 1 vNUMA node, but with 392 GB RAM, I’m crossing into the 2nd pNUMA node. Does the VMware scheduler recognize this and schedule my VM across both pNUMA nodes? or would this be one of the exceptions to the rule where I would need to configure my vCPU with 2 sockets?

    Reply
      1. Mark AchtemichukMark Achtemichuk Post author

        Let’s see if I followed this:

        You have a host with 4×8 and 512Gb RAM which means 4 pNUMA nodes with 128GB RAM each.

        The optimal VM size would be 1×8 with 128Gb to fit into the pNUMA node.

        I’m currently creating another article to look at use cases like yours with larger memory requirements than the pNUMA node has. Stay tuned.

        Reply
  6. Neode

    Hello Mark

    thanks for your article.
    If i understand, it’s better to use one socket when it’s possible and it’s the best practice for little vms. But for example, if i have nodes with 2×16 cores and i want some databases vms with 12 or 16 cores on the cluster, is it still the best practice to use 12 or 16 sockets / 1 core ?

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      One of the only reasons to really cross a pNUMA node with a VM would be that there isn’t enough memory on the pNUMA node to satisfy the configuration for the VM. Otherwise I suggest, as outlined above, to use the max pCore count of the pSocket first before incrementing the vSocket count.

      You suggest you have 2×16 hosts so I’d always use 1×1, 1×2, 1×3…1×15, 1×16, then 2×9, 2×10… etc

      So no – I suggest you don’t ever use 12×1 or 16×1 as configurations.

      Hope that helps clarify.

      Reply
      1. Yan

        Hi Mark,

        I expect you would say yes (assuming there is no OS/App licensing concern and VM memory is less than a pNUMA node), or this will contradict with what you mentioned in the article (extracted below).

        “The easiest way to achieve this was to leave Cores per Socket at the default of 1 which present the vCPU count as Sockets without configuring any virtual cores. Using this configuration, ESXi would automatically generate and present the optimal vNUMA topology to the virtual machine.”

        Please clarify. Thanks.

        Reply
  7. Ronny

    Hi Mark,

    does VMware provide some tools (PowerCLI code, vROps reporting, vCenter alarms) to check/remediate for optimal vNUMA configuration of a vSphere environment?
    If you plan to migrate to new hardware (i.e. hw lifecycle) this is quite an important task to address post-migration.

    regards,
    Ronny

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      There are some future efforts being made here but as of today I’m not aware of any VMware tool or code to support this. Sorry.

      Reply
  8. Wes Crockett

    Thanks for this write up. We have a server that we want to bump from a 4 core to a 6 core machine. would the best bet be to make it a single socket with 6 cores?

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      Do you mean you have a virtual machine with 4 vCPU and you are up-sing it to 6 vCPU’s?

      If so, then yes a 1×6 would be the most appropriate configuration assuming your ESXi host has at least 6 cores per socket.

      Reply
  9. Dennis

    I was wondering if the article has been written for larger memory requirements than the pNUMA node has. I currently need to create a SQL server with 512 GB of RAM with 8 vCPUs. The host is a dual processor with 12 cores per CPU and a total of 512 GB.

    Reply
  10. Jason Taylor

    Thanks for this article. I am somewhat confused as I thought 1 core per socket was the recommended configuration. The 2013 article seems to imply that was the best performance, but the table above seems to say the opposite, to increase the cores per socket.

    I have a Physical Server with 2 Sockets, 8 Cores per socket, so 32 logical processors. There is 512GB of RAM in this host. This host is ESXi 5.5. I have a VM with 14 vCPU and it is set at 14 sockets with 1 core per socket. It has 256GB of Memory. This server runs SQL 2012 Enterprise. Should it be set to 2 Sockets with 7 cores per socket?

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      This table and rule set should be considered the current recommendation replacing the 2013 article.

      Yes – your VM should be 2×7 for best performance and presentation.

      Reply
  11. Nuruddin

    Mark, Thanks for the wonderful article.
    Following is the query that we have and want to clarify:
    We’ve got 1 host with 8 Physical Cores and after hyper threading it shows 16 Physical cores
    Technically we can assign 16 cores to the VM but the article says you should not allocate more than available physical cores to a VM meaning we should allocate only 8 Cores.
    Question to you is what is the implication if we allocate more vCPU cores than the Physical Cores available on the physical host?

    Thanks
    Nur

    Reply
    1. Mark AchtemichukMark Achtemichuk Post author

      When you start to build virtual machines larger than the pCore count (which you can but I consider that an ‘advanced’ activity and you need to keep consolidation rations low), you place yourself in the situation in which you can generate contention quickly and make the performance worse than if you sized it less than or equal to pCore count.

      Remember Hyper-threading offers value by keeping the pCore as busy as possible using two threads. If one thread is saturating the pCore, there’s little value in using a second thread.

      So assuming your VM can keep 8 vCPU’s busy, then increasing the vCPU count doesn’t mean your VM will get more clock cycles. In fact it may now create scheduling contention and overall throughout is reduced.

      You can build VMs larger than pCore count when your applications value parallelization over utilization. For example, if you know the application values threads, but utilization of each thread never exceeds 50%, then 16 threads @ 50% = 8 pCore saturation.

      Reply
  12. Rekha

    Hi Mark,

    Thanks for the great writeup. Its clear not to enable hot add as its disables vNUMA.

    1. Is it applicable for small VMs with 4 vCPUs and 4 GB RAM too? If hot add is enabled, can i configure this VM with 4 sockets and 1 core and VMware will take care of allocating resources locally? or should I still go with 1 socket and 4 cores?

    2. For larger VMs – have a host with 2 sockets, 14 cores and 256 GB RAM, (128 GB Memory in one pNUMA).
    Two scenarios here.
    a)VM with 14 vCPUs and 32 GB Memory. So is it better to configure 1 socket 14 cores (as Memory is still local)?
    b) VM with 16 CPUs and 32 GB Memory. Configure 2 sockets and 8 cores.

    Pls clarify.

    Reply
  13. Andy C

    Hi Mark, great article which is helping me at the moment trying to get deployment of VMs to be right-sized. These articles are good to point people too.

    One thing that did occur to me and that is the Cluster On Die feature available on some processors. I’m expecting it won’t make any difference and vNUMA will treat it the same as ‘any other’ NUMA presentation.

    Is it just a matter that the CoD feature just increases the available NUMA nodes?

    So, say, a dual socket, 14 core host. with 256GB RAM. CoD feature enabled.

    Meaning there are 4 NUMA nodes consisting of 7 cores each & 64GB local mem each.

    How would the ‘Optimal config’ table look in that situation? Would 3 vNUMA nodes just be presented to a VM that requires 15 vCPUs (configured with 3 sockets, 5 cores) for example?

    Also, in the scenario above with the VM requiring 15 vCPUs (3 vNUMA nodes) would the optimal memory for the VM in this scenario be (up to) 196GB?

    Thanks again for the write up.

    Andy

    Reply
  14. Naga Cherukupalle

    Mark,

    Is this applicable in vSphere 6.0 Also?. Because the performance best practices document in 6.0 suggests to use 1 core and many virtual sockets. For example in my Current vsphere environment I have a vm with 20vCPU ( configured as Number of Virtual sockets =20 and number of cores per socket = 1) on Physical hosts with 28 Physical cores( 2 socket and 4 core per socket) . So Per your blog, can I reconfigure the vm as 20vCPU( Number of Vrtual Socket =2 and Number of Cores per socket =10) in order to get vNUma Benifits. Please advise as our environment is completely sized with More Number of Virtual sockets to 1 Core per socket in vms.

    Reply
  15. Thomas Meppen

    Hi Mark,
    I’ve recently deployed multiple OVA’s that are configured with 8 vCPU but they’re set as 8 vCPU spread across 8 sockets. My hosts are 2 sockets with 6 core each. Would it be wise to reconfigure them to be 4vCPU/2 sockets? Thank you and if you need any questions answered I’ll be more than happy to.

    Reply
  16. Christine Chen

    Hi Mark, we are using VMWare 6.0 and host has 2 sockets and 12 cores each. The VM for SQL Server needs 24 vCPU therefore we configured 3×8, even though there is only two physical sockets. The reason being is that we are running Microsoft Window Server enterprise edition and the limit of max cores is 8. We would have done 2×12 if we could… I am worried about 3×8 is not the best configuration as there is only two physical sockets.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*