- 30 Nov 2022 – vSphere 8 info is at Extreme Performance Series: Automatic vTopology for Virtual Machines in vSphere 8
- 18 Jul 2019 – correction in the example where only compute is considered for vNUMA
- 20 Oct 2017 – NEW with memory considerations
Using virtualization, we have all enjoyed the flexibility to quickly create virtual machines with various virtual CPU (vCPU) configurations for a diverse set of workloads. But as we virtualize larger and more demanding workloads, like databases, on top of the latest generations of processors with up to 24 cores, special care must be taken in vCPU and vNUMA configuration to ensure performance is optimized.
Much has been said and written about how to optimally configure the vCPU presentation within a virtual machine – Sockets x Cores per Socket.
Note: When you create a new virtual machine, the number of vCPUs assigned is divided by the Cores per Socket value (default = 1 unless you change the dropdown) to give you the calculated number of Sockets. If you are using PowerCLI, these properties are known as NumCPUs and NumCoresPerSocket. In the example screenshot above, 20 vCPUs (NumCPUs) divided by 10 Cores per Socket (NumCoresPerSocket) results in 2 Sockets. Let’s refer to this virtual configuration as 2 Sockets x 10 Cores per Socket.
This setting was originally introduced in vSphere 4.1 to overcome operating system license limitations. As of vSphere 5, those configuration items now set the virtual NUMA (vNUMA) topology that is exposed to the guest operating system.
NUMA is becoming increasingly more important to ensure workloads, like databases, allocate and consume memory within the same physical NUMA node that the vCPUs are scheduled. When a virtual machine is sized larger than a single physical NUMA node, a vNUMA topology is created and presented to the guest operating system. This virtual construct allows a workload within the virtual machine to benefit from physical NUMA, while continuing to support functions like vMotion.
While the vSphere platform is extremely configurable, that flexibility can sometimes be our worst enemy because it allows for many sub-optimal configurations.
Back in 2013, I posted an article about how Cores per Socket could affect performance based on how vNUMA was configured as a result. In that article, I suggested different options to ensure the vNUMA presentation for a virtual machine was correct and optimal. The easiest way to achieve this was to leave Cores per Socket at the default of 1 which presents the vCPU count as Sockets without configuring any virtual cores. Using this configuration, ESXi would automatically generate and present the optimal vNUMA topology to the virtual machine.
However, this suggestion has a few shortcomings. Since the vCPUs are presented as Sockets alone, licensing models for Microsoft operating systems and applications were potentially limited by the number of sockets. This is less of an issue today with operating system core-based licensing, which Microsoft transitioned to starting with Windows Server 2016, but is still a consideration for earlier releases.
Example: Since both Windows Server 2012 and 2016 only support up to 64 sockets, creating a “monster” Windows virtual machine with more than 64 vCPUs requires us to increase the Cores per Socket so the guest can consume all the assigned processors.
Example: A virtual machine with 8 Sockets x 1 Core per Socket, hosting a single Microsoft SQL Server 2016 Standard Edition license would only be able to consume 4 of the 8 vCPUs since that edition’s license limits to “lesser of 4 sockets or 24 cores.” If the virtual machine is configured with 1 Socket x 8 Cores per Socket, all 8 vCPUs could be leveraged.
Additionally, some applications, like Microsoft SQL Server 2016, behave differently based on the Cores per Socket topology presented to them.
Example: A virtual machine hosting a Microsoft SQL Server 2016 Enterprise Edition license created with 8 Sockets x 2 Cores per Socket may behave differently than a virtual machine created with 2 Sockets x 8 Cores per Socket, even though they’re both 16 vCPUs. This is due to the soft-NUMA feature within SQL Server which gets automatically configured based on the number of cores the operating system can use (reference: https://msdn.microsoft.com/en-us/library/ms345357.aspx).
vNUMA Behavior Changes in vSphere 6.5
In an effort to automate and simplify configurations for optimal performance, vSphere 6.5 introduced a few changes in vNUMA behavior. Thanks to Frank Denneman for thoroughly documenting them here:
Essentially, the vNUMA presentation under vSphere 6.5 is no longer controlled by the Cores per Socket value. vSphere will now always present the optimal vNUMA topology unless you use advanced settings.
vNUMA Considers Compute Only
When a vNUMA topology is calculated, it only considers the compute dimension. It does not take into account the amount of memory configured to the virtual machine or the amount of memory available within each pNUMA node when a topology is calculated. So, this needs to be accounted for manually.
Example: An ESXi host has 2 pSockets, each with 10 Cores per Socket, and has 128GB RAM per pNUMA node, totalling 256GB per host.
If you create a virtual machine with 128GB of RAM and 1 Socket x 10 Cores per Socket, vSphere will create a single vNUMA node. The virtual machine will fit into a single pNUMA node.
If you create a virtual machine with 192GB RAM and 1 Socket x 10 Cores per Socket, vSphere will still only create a single vNUMA node even though the requirements of the virtual machine will cross 2 pNUMA nodes resulting in remote memory access. This is because only the compute dimension in considered.
The optimal configuration for this virtual machine would be 2 Sockets x 5 Cores per Socket, for which vSphere will create 2 vNUMA nodes and distribute 96GB of RAM to each of them.
So how do we make this easier?
Based on the value a compute configuration can provide, these are some simple rules that can be followed to ensure the optimal configuration is implemented.
I propose the following guidelines:
- While there are many advanced vNUMA settings, only in rare cases do they need to be changed from the defaults.
- Always configure the virtual machine vCPU count to be reflected as Cores per Socket, until you exceed the physical core count of a single physical NUMA node OR until you exceed the total memory available on a single physical NUMA node.
- When you need to configure more vCPUs than there are physical cores in the NUMA node, OR if you assign more memory than a NUMA node contains, evenly divide the vCPU count across the minimum number of NUMA nodes.
- Don’t assign an odd number of vCPUs when the size of your virtual machine, measured by vCPU count or configured memory, exceeds a physical NUMA node.
- Don’t turn on vCPU Hot Add unless you’re okay with vNUMA being turned off.
- Don’t create a VM larger than the total number of physical cores of your host.
Example: This table outlines how a virtual machine should be configured on a dual-socket 10-core physical host to ensure an optimal vNUMA topology and performance regardless of vSphere version where the assigned memory is less than or equal to a pNUMA node.
Example: This table outlines how a virtual machine should be configured on a dual socket 10 core physical host to ensure an optimal vNUMA topology and performance regardless of vSphere version where the assigned memory is greater than that of a pNUMA node.
To sum it up, these are some of the benefits provided using these guidelines:
- The resulting vNUMA topology will be correct and optimal regardless of the version of vSphere you are using.
- Reduction in remote memory access since a virtual machine will be contained within the fewest number of physical NUMA nodes.
- Proper alignment with most operating system and application licenses models.
- Provides the in-guest application the opportunity for optimal self-configuration.
There are always exceptions to these guidelines, so let’s explore yours in the comments section below.
91 comments have been added so far
If your VMs were created the old way (VCPUs set by socket count), is it worthwhile to go back and change them to the core-count method after upgrading to vSphere 6.5? That is, VMs currently having 4 sockets x 1 core/socket. Should this be now changed to 1 socket x 4 cores/socket?
Or should this new guideline only be used when creating new VMs going forward? If so, does that mean the template VMs should be changed after upgrading to vSphere 6.5? Or would you need to rebuild your templates from scratch using the new guideline?
The configuration of the sockets and cores for the virtual machine becomes VERY important when you cross a pNUMA node.
When the virtual machine is smaller than a pNUMA node, these will ‘probably’ be less impact since the virtual machine will be scheduled into 1 pNUMA node anyways.
So I’d suggest an approach like:
1) re-configuring all VM’s that are larger than a pNUMA node (priority)
2) reconfigure templates so all new VM’s are created appropriately (priority)
3) evaluate the effort of re-configuring everything else (good hygiene) <- personally I'd do this
What’s the tradeoff in performance vs enabling hot-add vCpu and doing 1 core per socket being able to start small and only hot add cpus 1 at a time as needed. App owners historically are very bad at ‘right-sizing’. Also by turning off hot-add, you are taking away one of our best features of the platform over other hypervisors.
vNUMA will be off when CPU Hot-Add is on. So, if your VM spans over more than one physical NUMA node, they will not be presented within the VM. The guest operating system will see one large NUMA node with all cores and with “UMA” memory. It will not be able to tell which memory is local or CPU remote. NUMA aware applications like SQL Server database engine will have a performance impact.
Thus, you will have the performace impact that Sysinternal’s coreinfo.exe could tell you.
As Klaus said – if your app is NUMA aware (MS SQL being the best example) – you want vNUMA enabled with an optimal config.
Now if your VM is smaller than a pNUMA node, then the benefit may or may not be measurable, since the VM will be scheduled into a 1 pNUMA node. So you could continue to leverage hot add for small VM. I highly recommend not adding hot add into templates. That said, my experience is that hot add highlights a miss in operational processes. I might suggest evaluating a more comprehensive capacity and performance tool like vROPs and its right sizing reports.
Thanks for the clarification. When Hot add activated is it possible that a 4vCPU VM run on 2 pNuma node? Because it will see just a pool with all the core,without distincting the pNuma
I Repeat what you said to be sure I fully understand
Hot-Add CPU will only turn off the vNUMA presentation to that virtual machine (which incidentally doesn’t matter in this case since by default vNUMA is only enabled at 9 vCPU or larger). vSphere will still schedule that 4 vCPU virtual machine into a single pNUMA node.
Lets say I have a Host with 4 Sockets x 8 Cores per Socket and 512 GB RAM (slots fully loaded). My VM is 8 vCPU with > 256 GB RAM. According to the list above, I would have 1 socket by 8 cores leaving me with 1 vNUMA node, but with 392 GB RAM, I’m crossing into the 2nd pNUMA node. Does the VMware scheduler recognize this and schedule my VM across both pNUMA nodes? or would this be one of the exceptions to the rule where I would need to configure my vCPU with 2 sockets?
I was using 4 sockets as an example but math for 2 sockets. Anyway, I think you get what I’m asking.
Let’s see if I followed this:
You have a host with 4×8 and 512Gb RAM which means 4 pNUMA nodes with 128GB RAM each.
The optimal VM size would be 1×8 with 128Gb to fit into the pNUMA node.
I’m currently creating another article to look at use cases like yours with larger memory requirements than the pNUMA node has. Stay tuned.
Have you completed that article yet?
Excellent article and thank you very much for detailed explanation.
I’m also interested in the article covering other use cases.
Please let us know when it is available.
Your article highly anticipated.
thanks for your article.
If i understand, it’s better to use one socket when it’s possible and it’s the best practice for little vms. But for example, if i have nodes with 2×16 cores and i want some databases vms with 12 or 16 cores on the cluster, is it still the best practice to use 12 or 16 sockets / 1 core ?
One of the only reasons to really cross a pNUMA node with a VM would be that there isn’t enough memory on the pNUMA node to satisfy the configuration for the VM. Otherwise I suggest, as outlined above, to use the max pCore count of the pSocket first before incrementing the vSocket count.
You suggest you have 2×16 hosts so I’d always use 1×1, 1×2, 1×3…1×15, 1×16, then 2×9, 2×10… etc
So no – I suggest you don’t ever use 12×1 or 16×1 as configurations.
Hope that helps clarify.
I expect you would say yes (assuming there is no OS/App licensing concern and VM memory is less than a pNUMA node), or this will contradict with what you mentioned in the article (extracted below).
“The easiest way to achieve this was to leave Cores per Socket at the default of 1 which present the vCPU count as Sockets without configuring any virtual cores. Using this configuration, ESXi would automatically generate and present the optimal vNUMA topology to the virtual machine.”
Please clarify. Thanks.
you picked up on this, good for you. This is the kind of thing I’ve been seeing with VMware write ups for years, and with VMware’s documentation being written by god knows who you have no where to turn to at times. I am very happy to see newer blogs addressing these concerns. What I still see missing is that VMware bloggers often put together a good look at things without actually spelling out the action items needed, and when they are written without all the details needed, it still leaves admins guessing what to do. This is a good article in that, although confusing at first because of the change in recommendation, it provides the action items for the admins.
That being said, you log into vCenter 6.5, and you add CPU to a new VM with the drop down, it auto default to X # of sockets with 1 core per socket, exactly the opposite of what this article is now stating. So am I to believe that VMware is just now figuring this out about its own product? Perhaps its because the all the guys who created this great stuff have left the organization without leaving a good trail of internal documentation? can’t say. But I am glad that I came across this post.
does VMware provide some tools (PowerCLI code, vROps reporting, vCenter alarms) to check/remediate for optimal vNUMA configuration of a vSphere environment?
If you plan to migrate to new hardware (i.e. hw lifecycle) this is quite an important task to address post-migration.
There are some future efforts being made here but as of today I’m not aware of any VMware tool or code to support this. Sorry.
thanks for your answer!
I’m looking forward to those efforts.
Thanks for this write up. We have a server that we want to bump from a 4 core to a 6 core machine. would the best bet be to make it a single socket with 6 cores?
Do you mean you have a virtual machine with 4 vCPU and you are up-sing it to 6 vCPU’s?
If so, then yes a 1×6 would be the most appropriate configuration assuming your ESXi host has at least 6 cores per socket.
I was wondering if the article has been written for larger memory requirements than the pNUMA node has. I currently need to create a SQL server with 512 GB of RAM with 8 vCPUs. The host is a dual processor with 12 cores per CPU and a total of 512 GB.
Thanks for this article. I am somewhat confused as I thought 1 core per socket was the recommended configuration. The 2013 article seems to imply that was the best performance, but the table above seems to say the opposite, to increase the cores per socket.
I have a Physical Server with 2 Sockets, 8 Cores per socket, so 32 logical processors. There is 512GB of RAM in this host. This host is ESXi 5.5. I have a VM with 14 vCPU and it is set at 14 sockets with 1 core per socket. It has 256GB of Memory. This server runs SQL 2012 Enterprise. Should it be set to 2 Sockets with 7 cores per socket?
This table and rule set should be considered the current recommendation replacing the 2013 article.
Yes – your VM should be 2×7 for best performance and presentation.
Jason is using Esxi 5.5. Are you saying that this article supersedes the 2013 article for all versions or just v6.5
Mark, Thanks for the wonderful article.
Following is the query that we have and want to clarify:
We’ve got 1 host with 8 Physical Cores and after hyper threading it shows 16 Physical cores
Technically we can assign 16 cores to the VM but the article says you should not allocate more than available physical cores to a VM meaning we should allocate only 8 Cores.
Question to you is what is the implication if we allocate more vCPU cores than the Physical Cores available on the physical host?
When you start to build virtual machines larger than the pCore count (which you can but I consider that an ‘advanced’ activity and you need to keep consolidation rations low), you place yourself in the situation in which you can generate contention quickly and make the performance worse than if you sized it less than or equal to pCore count.
Remember Hyper-threading offers value by keeping the pCore as busy as possible using two threads. If one thread is saturating the pCore, there’s little value in using a second thread.
So assuming your VM can keep 8 vCPU’s busy, then increasing the vCPU count doesn’t mean your VM will get more clock cycles. In fact it may now create scheduling contention and overall throughout is reduced.
You can build VMs larger than pCore count when your applications value parallelization over utilization. For example, if you know the application values threads, but utilization of each thread never exceeds 50%, then 16 threads @ 50% = 8 pCore saturation.
Thanks for the great writeup. Its clear not to enable hot add as its turns off vNUMA.
1. Is it applicable for small VMs with 4 vCPUs and 4 GB RAM too? If hot add is enabled, can I configure this VM with 4 sockets and 1 core and VMware will take care of allocating resources locally? or should I still go with 1 socket and 4 cores?
2. For larger VMs – have a host with 2 sockets, 14 cores and 256 GB RAM, (128 GB Memory in one pNUMA).
Two scenarios here.
a)VM with 14 vCPUs and 32 GB Memory. So is it better to configure 1 socket 14 cores (as Memory is still local)?
b) VM with 16 CPUs and 32 GB Memory. Configure 2 sockets and 8 cores.
#1 – you can use Hot-Add CPU for VM’s that are smaller than pNUMA node size, since we’ll schedule the whole VM into a single pNUMA node anyways. But I wouldn’t use Hot-Add CPU for a VM larger than pNUMA node. I would also still use the best presentation of 1×4.
#2a – correct
#2b – correct
Hi Mark, great article which is helping me at the moment trying to get deployment of VMs to be right-sized. These articles are good to point people too.
One thing that did occur to me and that is the Cluster On Die feature available on some processors. I’m expecting it won’t make any difference and vNUMA will treat it the same as ‘any other’ NUMA presentation.
Is it just a matter that the CoD feature just increases the available NUMA nodes?
So, say, a dual socket, 14 core host. with 256GB RAM. CoD feature enabled.
Meaning there are 4 NUMA nodes consisting of 7 cores each & 64GB local mem each.
How would the ‘Optimal config’ table look in that situation? Would 3 vNUMA nodes just be presented to a VM that requires 15 vCPUs (configured with 3 sockets, 5 cores) for example?
Also, in the scenario above with the VM requiring 15 vCPUs (3 vNUMA nodes) would the optimal memory for the VM in this scenario be (up to) 196GB?
Thanks again for the write up.
That’s correct – using CoD will split the socket into two logical pNUMA nodes using a second home agent.
So CoD ‘off’ means each socket is a single pNUMA node (with all 14 pCores associated with it) with 128GB RAM.
CoD ‘on’ means that same socket now appears as two pNUMA nodes (each with 7 pCores associated with it) with 64GB RAM each.
The optimal configuration would be to use the fewest number of pNUMA nodes.
If you needed 15 vCPU’s, one more than could fit into two pNUMA nodes, the optimal presentation would be 3×5. If each pNUMA node had 64GB RAM, then with 3×5 you could access up to 192GB RAM before requiring a 4th pNUMA node – which incidentally I’d reconfigure VM to 16 vCPU and present as 4×4.
Is this applicable in vSphere 6.0 Also?. Because the performance best practices document in 6.0 suggests to use 1 core and many virtual sockets. For example in my Current vsphere environment I have a vm with 20vCPU ( configured as Number of Virtual sockets =20 and number of cores per socket = 1) on Physical hosts with 28 Physical cores( 2 socket and 4 core per socket) . So Per your blog, can I reconfigure the vm as 20vCPU( Number of Vrtual Socket =2 and Number of Cores per socket =10) in order to get vNUma Benifits. Please advise as our environment is completely sized with More Number of Virtual sockets to 1 Core per socket in vms.
Physical hosts with 28 Physical cores( 2 socket and 14 core per socket) .
Yes these guidelines apply to all vSphere versions.
If your ESXi host is 2×14, then I’d configure a 20vCPU VM as 2×10 – especially of the VM App is NUMA aware.
I’ve recently deployed multiple OVA’s that are configured with 8 vCPU but they’re set as 8 vCPU spread across 8 sockets. My hosts are 2 sockets with 6 core each. Would it be wise to reconfigure them to be 4vCPU/2 sockets? Thank you and if you need any questions answered I’ll be more than happy to.
OVA’s are often defined with just sockets as the OVA developer doesn’t know onto which type of processor they will be deployed. This is the safest deployment model. That said, I personally would reconfigure the VM to 2×4 for this specific configuration.
Hi Mark, we are using VMWare 6.0 and host has 2 sockets and 12 cores each. The VM for SQL Server needs 24 vCPU therefore we configured 3×8, even though there is only two physical sockets. The reason being is that we are running Microsoft Window Server enterprise edition and the limit of max cores is 8. We would have done 2×12 if we could… I am worried about 3×8 is not the best configuration as there is only two physical sockets.
3×8 isn’t optimal since it will create 3 vNUMA clients of which the first one will be scheduled on socket 0 and the second scheduled on socket 1, the third client won’t fit cleanly on either socket so will probably bounce back and forth depending on socket load. This will create unnecessary migrations. Ideally, it should be 2×12.
Depending on the load, you may, or may not, see an affect on the application but its not an optimal configuration.
Hi Mark, I am upgrading a SQL server from 16vCPU to 24 vCPU. Is 2×12 the proper configuration if I have a 2×10 pCPU on the host?
If the server only has 2 sockets with 10 cores per socket, plus Hyper-threads (so 40 logical processors), I’d be cautious creating a single VM larger than 2×10 (see guideline #6).
The reason for the conservative recommendation is that “if” you saturate 20 pCores with a single VM, then the hyper-threads offer little value. So anything larger than 20 vCPUs may impact performance, unless the VM/App values parallelization and the vCPUs don’t all run at 100%. Then you could over-provision the pCore count for a single VM. But watch contention closely – especially Co-Stop!
So in the case of over-provisioning pCore count, yes 2×12 would be the right configuration – but – watch contention closely.
Hello, I have a R510 server with (2) 6 core processors. , in esxi 6.5 the cpu selection has changed. It has a pull down for “CPU” 1-12 (total cores available) and then “Cores per Socket” this puldown is the CPU count or lower. Next to that says “Sockets :”
If I put 2 CPUs and 2 ‘Cores per Socket’ it says Sockets: 1
If I put 2 CPU’s and 1 ‘Cores per Socket’ is says Sockets: 2
I can not find an example of what this means… all of the examples just say something like pick 1 cpu and 8 cores, not sure how to do that in 6.5
If you were to install 2 winodws server 2016 one of them running SQL server 2016, VMs on this box, what would you set the CPU/cores per socket for ?
Thanks for any clarification.
I’m confused by two of the statements in this article:
> Essentially, the vNUMA presentation under vSphere 6.5 is no longer controlled by the Cores per Socket value. vSphere will now always present the optimal vNUMA topology unless you use advanced settings.
> When a vNUMA topology is calculated, it only considers the compute dimension. It does not take into account the amount of memory configured to the virtual machine, or the amount of memory available within each pNUMA node when a topology is calculated. So, this needs to be accounted for manually.
So, in vSphere 6.5, how are we best to ensure that high memory but relatively low vCPU workloads are using a vNUMA topology that matches the physical topology?
I have multiple SQL guests that are allocated 8 vCPU and 128GB of RAM. My physical hosts are 20 core x 2 sockets (40 cores total) with 512GB of RAM. How do I ensure that if I have 3 or more SQL guests active on the same host, they’re not running into memory/performance problems from spanning NUMA nodes? My assumption is that under normal operation and just letting VMWare do it’s thing, I’d wind up with:
– All three guests with 1 vNUMA node, 8 cores, 128GB of RAM
Physical allocations: the host is using some RAM out of each pNUMA bank. Guest 1 allocated to pNUMA bank 1, is fine. Guest 2 allocated to pNUMA bank 2, is fine. Guest 3 comes along, there isn’t enough RAM available in either pNUMA node to fully satisfy the 128GB request, so guest 3 gets its memory spanned across NUMA nodes despite a 1 vNUMA node configuration.
You are correct.
If each VM is 1×8 with 128GB RAM, then the ESXi schedulers will put two VM’s on Socket0 and one VM on Socket1. You’re are also correct in the fact that available memory per physical NUMA node is slightly less than 256GB since the hypervisor uses some (let’s call it 2-4%) so the guest OS on the second VM on Socket0 will still only see 1 vNUMA node, but in fact will be accessing a small amount of memory from a different pNUMA node (less than 10GB).
So should we worry about this? In my experience – No. It would be rare to see it negatively affect database KPI’s since it’s a small amount of memory and still very low latency.
Hi Mark. What is the difference for a vm 16 x 1 or 4X4 or 1 x 16 in a host Intel 2 x 16 and hypertrading on (64 logical processor) . Can you explain? I suppose non difference for vnuma. I ask you because we have hundreds of vm 4X4. Thank’ s
Assuming we’re talking about using vSphere 6.5 then vNUMA doesn’t matter since its automatically optimized.
BUT, the presentation of Sockets and Core is still VERY important to the guest OS. It makes its own decisions based on what it sees. So the performance ‘could’ be different between 1×16, 16×1 and 4×4. The optimal configuration here is 1×16. While this may not make a difference depending on the application and what its doing, it is the right hygiene.
Hi Mark ,
Point No 5 , Dont enable hot add cpu , Here are some questions for you ,
1. I have a created a VM with 12 vCPU , Host has 2 CPU Packages with 10 Core , Would vNUMA expose this to OS ?
2 . Hypothetically another scenario , I have created a with 10 vCPU , and enabled hot add , and added 2 more vCPU , how this is different from 1 . and why NUMA would not expose to this VM , why it would go to UMA ?
1) This answer depends on what version of vSphere is being used, the VM configuration and whether or not HotAdd has been enabled.
If we assume vSphere 6.5, and HotAdd is not enabled, then a 12 vCPU VM on that host, regardless of the presentation of vSocket and vCores, will be presented 2 vNUMA nodes.
Note: Optimally it should be configured as 2×6 without HotAdd.
If we assume vSphere 6.5 and HotAdd “is” enabled, then a 12 vCPU VM on that host will only be presented 1 vNUMA node.
2) Since you’ve enabled HotAdd, the VM will only ever see 1 NUMA node, regardless of whether it is 10 vCPU or 12 vCPU.
I guess the big question is – what is the workload and is it NUMA aware?
Mark, Thanks for you great insights. I’m analyzing a potential performance issue and would like to hear your thoughts.
Physical server is AMD Opteron(tm) Processor 6378.
CPU Packages: 2
CPU Cores: 32
CPU Threads: 32
I believe these are 8 cores per socket with HT to give 16 logical processor per socket. NUMA Node count is 4, so each 4 pNUMA size is 8 cores with 127GB.
The guest VM is allocated 16vCPUs and 64GB. numactl shows single vNUMA but I feel it could be 2 pNUMA nodes. It is possible that the non-local memory access is involved and application is impacted with performance issues. vmstat and iostat on guest VM do not indicate any bottleneck but the application is impacted with long JAVA GC pauses.
Currently don’t have access to host physical server. Requested for NUMA stats from esxitop.
Meanwhile appreciate your thoughts.
The AMD Opteron 6378 is a multi-chip module based on Piledriver architecture. AMD doesn’t use Hyper-Threading which is an Intel only SMT technology. (note: Just recently AMD released their own flavour of SMT in the Zen architecture).
So each 16 core CPU package is actually 4x 4 core piledriver modules which translates to 4 pNUMA nodes per package.
So a dual socket configuration would mean a total of 8 pNUMA nodes and 32 Cores that ESXi would see.
Now what you see from the in-guest perspective is controlled by the VM configuration itself and the version of ESXi that’s being used.
Example: Using vSphere 6
VM configured as 1 vSocket x 16 Cores per socket = 1 vNUMA node
VM configured as 16 vSockets x 1 Cores per socket = 4 vNUMA nodes
So you should clarify the exact physical server specs, vSphere version, and how the VM is configured. From that we can generate an optimal config.
Additionally, GC is a serial process and thus can only use 1 vCPU – if your feeling GC pain, I’d check for vCPU saturation and contention.
Great write up, but the one thing I have found with testing different configurations is this, it makes no difference how many sockets you provision vs cores, or vice versa, vNUMA always does what it wants by default, which is to divide up the vCPUs across the underlying NUMA nodes of your host soon as you go past 8 vCPUs and surpass the # of cores in the NUMA \ socket. Even if I have a quad socket server, and I exceed 8 vCPUs and exceed the core count in a single numa node, if I provision this VM with 4 sockets it will still not show 4 NUMA nodes in the guest OS, but just 2. The cores vs sockets thing means absolutely nothing other than licensing issues with certain product. The only time it goes past 2 NUMA nodes in the guest OS is when your vCPU allocation exceed the # of CPUs in 2 of the NUMA nodes. and it always seems to divide evenly across the NUMA nodes that it presents to the guest OS. So, based on this, the recommendation to allocate VMs in a core per socket fashion is irrelevant no matter how many vCPUs you need to allocate. VMware vNUMA doesn’t present NUMA nodes based on how many sockets you provision when allocating vCPU. TEST it out folks and you will see.
Bottom line, vNUMA does its own thing regardless of what you want do. You would need to enable the advanced parameters to do what you want to do. But I wouldn’t, VMware has set this up really nicely. Also, another take away, just keep your VM vCPU allocations under the 3# of cores in your NUMA node, otherwise, in a highly shared environment, not only do you introduce remote memory access in NUMA, but your CPU ready times will go up and you might not get any gains with additional CPU at all.. depending of course.
Need to correct my statement above, after further review and testing in v6.5, the statement “it makes no difference how many sockets you provision vs cores, or vice versa, vNUMA always does what it wants by default” is incorrect. Please disregard this statement.
You do have control over cores per socket when vNUMA becomes enabled, and once it is, if you deploy a VM with 4 virtual sockets you will see those 4 sockets presented to the Guest OS as NUMA nodes. What I do see is if you deploy a VM with more virtual sockets than the # of NUMA nodes in the host, vNUMA does some math and divides up the # virtual sockets to an optimal # and presents it. On a quad socket server, I tested changing the # of virtual sockets leaving the same # of vCPUs for each test, and the result was that sometimes I would end up with 3 NUMA nodes presented, and sometimes 4, etc. So it does make sense to deploy cores per socket and let VMware handle the NUMA presentation to the Guest.
All this automatic vSphere CPU layout optimizations look fine while we talking about relatively small workloads.
Are there any tech resources available on how to fine tune big workloads that are sensitive to the latency?. Like VMs with 80 vCPU, 512GB RAM. We see differences in performance just making some layout changes, but no guidance who to do it right way.
Most of those scarce optimization/fine tuning doc’s we find are outdated and not valid anymore with vSphere 6.5.
the new XEON Gold and Platinum Processors can only address 768 GB RAM per Socket. If you want 1,5TB RAM per Socket you have to
buy a M variant. Assume i have a server with for example 2 sockets, each socket 22 cores and 768 GB RAM per Socket(a total of 1,5 TB per server).
As long as I configure my VMs with <= 22cores and <=768 GB they are always scheduled into one pNUMA Node ?
The VMs should be configured as 1socket/1-22 cores ?
What would happen if a VM would be larger than one pNUMA node and the Xeon is not the M Version. A virtual CPU on pNUMA Node 1 wouldnt be able to access the memory of a virtual CPU on the other pNUMA node because the physical CPU can only address its own physical memory(768GB). So a remote NUMA Access wouldnt be possible. What would be the consequence ? Will the VM crash ?
Standardization is our day to day practice. All of our ESXi hoss are dual Socket with different core-count and memory configurations, they go from 8, 10,14 Cores/Socket and memory from 96, 128, 256 and 384 GB . We currently still run Vsphere 6.x. We have in our environment VMs that are configured with more memory than a vNuma handles but also small VMs where memory is equal or smaller than vNUMA.
On this scenario if we want to standardize across the board and knowing that all hosts are 2 Sockets would the SECOND TABLE be the way to go even if some of our VMS have equal or smaller Mem config than the vNUMA node? Is there a performance impact by doing this?
First of all, we appreciate your contribution to make us very comfortable with VM resource provisioning. I have very quick question about not of vNUMAs created with below said environment.
ESXi version 6.0:
ESXi host: 2 Sockets 8 Cores per socket.
RAM: 512 GB
VM: 14 Sockets 1 core per socket.
No. Of vNUMA: ??
ESXi version 6.0:
ESXi host: 2 Sockets 8 Cores per socket.
RAM: 512 GB
VM: 7 Sockets 2 core per socket.
No. Of vNUMA: ??
ESXI Host configuration:
Processor Sockets: 2
Cores per Socket: 10
Logical Processors: 40
As Hyper-Threading is enabled can I set the below configuration for Oracle installed on Linux VM to have proper vNUMA configuration.
Number of virtual sockets: 1
Number of cores per socket: 20
Memory: 128 GB
How do we understand if vnuma is working / enabled on the virtual machine, when more than 8 vCPU are assigned to a VM.
I have few questions regarding our SAP Systems on VM, First this is our landscape:
VMWare 6.5 / 5 ESXi 6.5 Hosts (IBM Flex System x240 / Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz)
Each Server Memory is Total 191.97 GB / CPU are:
Model Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz / Processor speed 2 GHz / Processor sockets 2 / Processor cores per socket 8 / Logical processors 32 / Hyperthreading Enabled / 2 NUMA nodes 96 RAM Size.
Our SAP systems are all VM’s
1. If my SAP system (Included Oracle Database) required 32 vCPU / 92GB RAM , what is the best configuration 1S X 32Cores / 2S X 16Cores / 4S X 8Cores / 8S X 4Cores?
2. Shall big SAP systems (Same as question one) should be the only VM on ESXi host?
3. How it’s passable to configured a single VM 4 Sockets if the ESXi host only have 2 Sockets? Or how it’s passable to give to VM 1 Socket with 12 Cores per socket.
Question about NUMA:
1. NUMA is the Socket + System Bus RAM on ESXi host?
1. Best way is to mimic layout of HW. So 2S x 16cores. Though maxing out on your HW is not a best idea – some resources are required for hypervisor as well.
2. Very bad practice is to place huge VMs and small VMs on same host. CPU scheduling problems guaranteed.
3. vSphere allows that – it’s just another abstraction level, but by default as of 6.5 vSphere – ESXi will actually ignore it by default.
Hi Mark, thanks for the great article.
We are looking to optimize the CPU performance of a single VM (no other VMs on the Host).
What would be the optimal CPU configuration for a single VM on this Host?
ESXi 5.5 U3
2 CPU Sockets
8 Cores Each
32 Logical Processors
128 GB Ram
VM currently configured for 32 GB Ram (can bump up to 128 since no other VMs)
VM currently configured for 4 CPU Sockets, 4 Cores per Socket
What about Hyperthreading turned on vs turned off in this scenario?
We are open to reducing vCPU to a single NUMA node if it will improve overall performance. Real-time CPU usage is averaging about 10% on the VM.
Looks like Mark is not monitoring this part of the blog as of Jan 2018.
But anyway – what problem you are trying to solve? Cause in general just follow the guidelines in Mark’s article and you’ll be good. Fitting into single NUMA node (if you can downsize to that level) usually benefits performance.
Hi Mark, thanks for the great article !!!
Hi Mark, Tkanks you verymuch !!!
Hi Mark, thanks for the great article.
Suppose a DELL R815. Dell opteron 6300 series. 32 cores total. 256 GB of RAM.
4 sockets -> 8 cores per socket -> VMware sees 8 numa nodes!, so 64 GB per numa node.
So, you have 2 numa inside 1 socket!
Mistery: How many sockets and cores should i set on a VM that requires:
– 10 cores and 64 GB ram
– 10 cores and 128 GB ram
– 10 cores and 192 GB ram
Sockets have 8 cores, but numa separates/takes into consideration, also THE cores?
So… 1 numa = 4 cores/64 GB of RAM? or this is an incorrect approach?
This example mentioned in the article is confusing to me:
An ESXi host has 2 pSockets, each with 10 Cores per Socket, and has 128GB RAM per pNUMA node, totalling 256GB per host.
If you create a virtual machine with 128GB of RAM and 1 Socket x 8 Cores per Socket, vSphere will create a single vNUMA node. The virtual machine will fit into a single pNUMA node.
If you create a virtual machine with 192GB RAM and 1 Socket x 8 Cores per Socket, vSphere will still only create a single vNUMA node even though the requirements of the virtual machine will cross 2 pNUMA nodes resulting in remote memory access. This is because only the compute dimension in considered.
The optimal configuration for this virtual machine would be 2 Sockets x 4 Cores per Socket, for which vSphere will create 2 vNUMA nodes and distribute 96GB of RAM to each of them.”
Is this correct???
As of vSphere 6.5, changing the ‘cores per socket’ value no longer influences vNUMA or the configuration of the vNUMA topology. This new decoupling of the ‘cores per socket’ setting from vNUMA allows vSphere to automatically determine the best/most optimized vNUMA topology to the VM unless advanced settings are used.
And isn’t vNUMA by default enabled for VMs with 8+ vCPU?
Are there advanced parameters configured to get two 2 vNUMAs created for the VM in the example?
I did the test myself and did not get two NUMA nodes presented inside the guest os (did not configure any advanced parameters)
ESXI host: 2x 10cores and 384GB, 2 NUMA nodes, so each NUMA node 10cores/192GB
TEST1: VM with 10 vCPUs: 1 socket / 10 cores + 256GB vRAM => resulted in 1 NUMA node inside guest os
TEST2: VM with 10 vCPUs: 2 socket / 5 cores + 256GB vRAM => resulted in 1 NUMA node inside guest os (ESXTOP also only shows 1 NHN)
Please advice. Thx!
In our enviroment we currently use 2 x 14 core (56 vCPUs) with 256GB.
We use Tee-shirt sizes to ensure that the ram is shared across all of the CPUs and also to leave a bit of ram for the ESX host.
The tee-shirt sizes are 2 vCPU (8GB), 4 vCPUs (16GB), 8 vCPUs (32GB) 14 vCPUs (56GB), 28 vCPUs (112GB) and 56 vCPUs (224GB) of ram.
This ensures there is always 32GB ram for the ESX host.
For every tee-shirt that is 28 vCPUs or lower I would always expect it to be defined as a single NUMA node to ensure that I do not end up with multiple ESX guests spread across multiple NUMA nodes with the increased likelihood of “Noisy Neighbour” interference.
If you take the case of 3 ESX guests as follows: 1 x 14vCPUs, 1 x 14 vCPUs and 1 x 28 vCPUs would work better than 2×7 vCPUs, 2 x 7 vCPUs and 2 x 14vCPU.
Good morning Mark,
I took the time to read several times your complete article.
However, I do not suceed to confirm if hyperthreading must be or must be not enable when you deploy SAP Hana VM on VMware vSphere.
Could you confirm this technical point please ?
Great article, thanks. Now I understand the “vNUMA behaviour” vs “selected number of socket”. Keep doing
Looking for advice on below question.
We use NVIDA Grid and up to now a ESX server with 2 16 Core Processors and 512GB ram we have say 4 VM’s running with 8VCPU and 128GB ram each.
Nice and easy no problem but what I would really like to do is the following give each VM 28VCPU and 128GB Ram.
The reason for this is that each VM is not fully loaded CPU ways all the time and there is a task that the workstation dues that really needs more cores when that task is running..
So my idea is when a VM needs the CPU it gets it. I reliase that if all four VM,s were to run the task that needs more cores all at the same time then they would just share the CPU.
Is this a bad idea just looks like a waste to have dedicated resources applied to a VM that can’t be used by another when not bussy
I’m still learning about vmware VM optimization. So can someone tell me if this is optimal?
ESXi host = 2 sockets 28 cores per socket
32 vCPUs = 16 per socket
128 GB of Memory
I’m still learning about vmware VM optimization. I don’t see any questions about sockets with 28cores. So can someone tell me if this is optimal?
ESXi host = 2 sockets 28 cores per socket with hyperthreading enabled
32 vCPUs = 16 per socket
128 GB of Memory
cpu hot plug turned off
For assistance with specific calculations, please refer to this VMware Fling:
Virtual Machine Compute Optimizer (VMCO)
Clients I work with continually misconfigure cores per socket. If for example, we ask them to provide a VM with 4 total CPUs, they will add 4 cores each with 1 socket. We have to ask that they reconfigure to use 1 core with 4 sockets. It seems that the default setting for CPU is misconfigured by default. Is there any plan by VMWare to update the settings so they default to the correct configuration?
actually it is in a VMware whitepaper to have one core per socket but that might not the best solution for any kind of workload. However I suspect non linear distribution of load with active hyperthreading. I have a singlethreaded workload in a 8 vCore VM on a CPU that has 8 physical cores… it finishes in 150 seconds if executed in one instance and 300 seconds if executed in 4 instances (one core per socket). Meaning the scheduling is faulty in ESXI 6.7 cu3
We also ran test on MS SQL server (in KVM) that performs a multi threaded workload 20% faster with 10 cores per socket compared to 10 sockets with 1 core. Whatever you do there you need to benchmark it…
I got a host with 2 pcpus, 18cores per cpu so there is total of 36 cores per node. and 512GB
how do i configure sql vm with 16vcpus and 128GB ram
should i do:
1vcpu, 16 cores per socket? to stay on the same vnuma node?
does this mean the 1cpu/16cores will have access to the 128GB ram with crossing to the other?
have looked your [MCL2033] session, it was great.
Please help me with understanding vNUMA in such configuration:
– host has 2*10 cores and 256GB RAM, so pNUMA has 10 cpu and 128 GB RAM;
– we create vm with 6 vcpu and 192GB RAM;
– we change core per cpu value from 1 to 3, so we have 2 vCPU with 3 vCore each.
Why does the scheduler create 2 vNUMA if it estimates only vCpu count?
There is updated advice from Frank Denneman which contradicts the advice in this article.
So as usual, it’s both grey and black and white all at once!
My databases VMs are extremely large. I have a 2 socket * 20 core physical host with HT enabled, So total of 80 logical processors. I am currently using 16 cores and 2 sockets. So, 32 CPUs total and only one VM resides on the host. When the SQL DB VM runs host, I see my host also complaining about CPU usage on it. Can I change it to 24 Cores* 2 sockets to check if my DB performs better? MSFT is recommending that, but what do you suggest?