Persistent Memory technology as the new normal with VMware vSphere® Memory Monitoring and Remediation (vMMR)

Current Business Trends

Modern trends are changing the way people and businesses are interacting with technology. IDC predicts that global data creation and replication will experience a CAGR of 23% until 2025*. In 2020 alone, 64.2 ZB of data was created or replicated. This growth of data is primarily fueled by trends such as 5G, IoT, social media, and proliferation of mobile devices. However, this phenomenon is presenting both opportunities and challenges for enterprises because while they want to extract maximum value from the generated data, they lack the associated infrastructure that can result in valid business outcomes.

Businesses have already started on their digital transformation journey with initiatives like application modernization, DevOps, data mobility, and as well as cloud and edge computing. Enterprises are adopting AI/ML applications, in-memory database applications etc. to derive more business value out of the data, and these applications need to run in real-time for the results to make an impact, and for changes to be applied immediately to improve business results.

These applications require large memory capacities and are also latency sensitive. Traditionally, applications such as large in-memory databases, OLTP/OLAP class applications have required large amounts of memory and there is often a difficult trade-off of cost and performance that customers must choose from, since Dynamic random–access memory (DRAM) is often one of the most expensive components in the server infrastructure.

Intel® Optane™ Persistent Memory (PMem) can provide the right balance between infrastructure cost and performance. PMem (up to 3x cheaper than DRAM) helps reduce overall server costs for enterprises with the trade-off of higher latencies (order of 10s) compared to DRAMs. A combination of DRAM and PMem has been found to suit most applications by providing the best latencies of DRAM while providing the increased capacity from PMem.

We will next explore how memory monitoring can play a key role in overcoming some of the challenges with adopting Intel® Optane^TM PMem.

Persistent memory support on VMware

With VMware vSphere 6.7EP10, VMware announced support for two different access modes supported by Intel® Optane™ PMem – Memory Mode and App-Direct Mode. For more details, refer to:

https://blogs.vmware.com/vsphere/2019/04/announcing-vmware-vsphere-support-for-intel-optane-dc-persistent-memory-technology.html

In Memory Mode, Intel CPUs utilize lower-latency DRAM as a cache-tier for contents stored in higher-latency Intel® Optane™ PMem. The total available memory capacity for use by vSphere and VMs is the amount of persistent memory available on the system. The advantages of Memory Mode are that applications can run unmodified, and that persistent memory offers lower latencies compared to the fastest storage devices. For more details on persistent memory support on VMware, please refer to:

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-93E5390A-8FCF-4CE1-8927-9FC36E889D00.html

Persistent Memory Challenges with Memory Mode

When configuring the right hardware, customers often face a dilemma when choosing the appropriate mix of DRAM and PMem. Typical ratios tend to be 1 (DRAM): 4 (PMem) to give almost 20-30% cost savings over traditional DRAM-only deployments. However, workload changes are expected to happen over time, either through organic growth of existing applications or new applications added to the cluster. These changes may lead the host to operate at an inadequate ratio at a future point in time.

With Memory Mode, VMware follows a prescriptive approach to allow partners and customers to size the quantity of DRAM and persistent memory on their servers. Until recently, most partner or customer requests for specific DRAM to Persistent Memory ratio combinations had to go through an RPQ (Request for Product Qualification) process that involved VMware carefully reviewing the chosen ratios. A good ratio of DRAM to persistent memory typically guards against performance issues caused by continuous misses to the DRAM cache and increasing bandwidth to the persistent memory. Since the caching algorithm with Memory Mode is largely controlled by the CPU, there are very few insights for customers to see if there are any on-going problems between the usage of DRAM and persistent memory. In addition, per-VM statistics also becomes very important rather than just Host-level statistics because each VM has different needs based on the workload it runs. These performance statistics help to see any irregular performance or starvation issues among VMs and help the customer remedy them quickly. Ideally, for best performance, more and frequent DRAM access is desirable while PMem accesses should be minimum.

VMware has been using the Persistent memory Knowledge Base article (KB) https://kb.vmware.com/s/article/67645 to guide customers and partners about usable PMem configurations.

VMware vSphere Monitoring and Remediation (vMMR)

VMware has introduced a new feature with vSphere 7.0U3 called vSphere Memory Monitoring and Remediation (vMMR). vMMR helps bridge the need for monitoring by providing running statistics at both the VM (bandwidth) and Host levels (bandwidth, miss-rates). vMMR also provides default alerts and ability to configure custom alerts based on the workloads that run on VMs.

This feature can help administrators monitor and tune usage and accesses for VMs and applications, and if required perform remediation by moving VMs to Hosts so that they can continue delivering on the SLAs. In future versions of vSphere, based on the statistics gathered by vMMR, Dynamic Resource Scheduler (DRS) will be able to perform automatic remediation in certain cases.

It is highly desirable, for instance, for VMs on a Host to have a fair share of DRAM usage at low latency for the application “active set of memory pages” while at the same time minimizing usage of persistent memory. The usage of DRAM as a cache sometimes makes application use of DRAM vs persistent memory unpredictable, and as the number of VMs and applications increase, and providing such fairness or SLA guarantees becomes more challenging. vMMR provides insights into memory usage for both DRAM and persistent memory, so that appropriate remedial action can be taken to move VMs to the proper host.

For example, if a VM encounters constant trashing from DRAM misses, or if the persistent memory bandwidth usage is consistently high and performance reaches a critical stage as seen from the vMMR-based statistics, administrators can use VMware vMotion to migrate that VM to a host that has a better amount of DRAM, and persistent memory resources available by looking at statistics on the target Host. This helps streamline operations and keep the applications running with a steady performance.

vMMR in Action

The below figures show the memory pane on the performance monitoring tab in vSphere.

Alarms can be setup with specific memory conditions corresponding to when there may be issues with Host or VM performance. When these alarms are triggered, they can display warning alerts to the administrators in vSphere.

In this case, a custom alarm could be setup so that it’s triggered for a specific DRAM condition – when DRAM bandwidth reaches 50GB/s.

To set custom alarms and get notified on conditions such as increased PMem bandwidths, please see:

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.monitoring.doc/GUID-D0D8090C-242A-4A21-86B4-59FC7C8A5871.html

The following figures show how to create new custom alarms along with some statistics on which these can be triggered.

The below figure shows the warning that is displayed when the alarm is triggered.

If you’d like to see a live demo of vMMR in action, check out this video on the VMware section of the Intel website:

https://www.intel.com/content/www/us/en/partner/showcase/vmware/vmworld-21-optane-pmem-mainstream-video.html

Conclusion

With vSphere vMMR, businesses can now start introducing more servers with Intel® Optane™ Persistent Memory technologies into their environment because of the added insurance provided by the monitoring and alerting capabilities from vMMR.

Starting with vSphere 7.0U3, an RPQ process is no longer required for using Intel® Optane™ Persistent Memory on servers. Instead, customers can follow a set of best practice recommendations that define how DRAM and PMem should be installed in server memory slots along with the recommended ratios. A good place to start is the Best Practices:

Intel® Optane™ Persistent Memory with VMWare technical whitepaper from Intel: https://www.intel.com/content/www/us/en/architecture-and-technology/vsphere-optane-pmem-best-practices-guide.html

vMMR enables administrators to work in a more predictable environment since they can closely monitor and proactively perform remediation to maintain the SLAs or minimum performance requirements expected of applications. The statistics can also help administrators initially during hardware evaluation to decide on the optimal persistent memory and DRAM combination that is suited to the customer’s specific workloads. This results in reducing the overall TCO** for businesses.

vMMR enables more memory scale as admins can systematically add large capacities while carefully monitoring the corresponding effects on performance.

Future

As a part of the evolution on the usage of Intel® Optane™ Persistent Memory on vSphere, VMware and Intel are collaborating on several future enhancements to vSphere. At VMworld 2021, we announced Project Capitola, demonstrating some of the benefits and some early work done on intelligent memory tiering using Intel® Optane™ PMem. VMware also presented some early performance benchmarks that showed promising results and clear cost benefits with using software tiering enabled by Project Capitola on Intel^(R) PMem, using such benchmarks as SpecJBB and HammerDB. Recorded sessions from VMworld 2021 can also be viewed on-demand at: https://www.vmware.com/vmworld/en/video-library/video-landing.html?sessionid=16208383900260013daz. Stay tuned for more information on Project Capitola and memory tiering beyond PMem in Memory-Mode, with software defined memory from VMware in 2022.

Please also check out several related sessions in VMworld 2021 listed below which deal with how VMware vSphere will continue to build on achieving virtualization of different types of memory, challenges, and solutions, when used in a cluster.

Resources

VMworld 2021 sessions:

Bring Intel PMem into the Mainstream with Memory Monitoring and Remediation [MCL3014S]
How vSphere Will Redefine Infrastructure to Run Future Apps in the Multi-Cloud Era [MCL2500] video on-demand
Project Capitola: [Confidential]: Unbounding the ‘Memory Bound’ [MCL1453] video on-demand
Big Memory – An Industry Perspective on Customer Pain Points and Potential Solutions [MCL2384] video on-demand
The Big Memory Transformation [VI2342] video on-demand
Prepared for the New Memory Technology in Next Year’s Enterprise Servers? [VI2334] video on-demand
60 Minutes of Non-Uniform Memory Access (NUMA) 3rd Edition [MCL1853]
Chasing Down the Next Bottleneck – Accelerating Your Hybrid Cloud [MCL2857S]video on-demand
Implementing HA for SAP HANA with PMem on vSphere 7.0U2 [VI2331]
5 Key Elements of an Effective Multi-Cloud Platform for Data and Analytics [MCL1594]video on-demand

Persistent Memory related collateral:

Persistent Memory KB article: https://kb.vmware.com/s/article/67645
Official VMware Link for vMMR: https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-CE019F04-DEA1-473B-ADBC-64607899BD8F.html
Persistent Memory VMware Blog: Understanding PMEM
Project Capitola Blog: https://blogs.vmware.com/vsphere/2021/10/introducing-project-capitola.html
Blog on Project Capitola and other vSphere innovations: https://blogs.vmware.com/vsphere/2021/10/how-innovations-in-vsphere-are-redefining-infrastructure-to-run-future-apps.html
Memory troubleshooting https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.monitoring.doc/GUID-D0D8090C-242A-4A21-86B4-59FC7C8A5871.html
Intel® Optane™(TM) Persistent Memory Quick Start Guide: https://www.intel.com/content/dam/support/us/en/documents/memory-and-storage/data-center-persistent-mem/Intel-Optane-DC-Persistent-Memory-Quick-Start-Guide.pdf
Intel Best Practices Guide: https://www.intel.com/content/www/us/en/architecture-and-technology/vsphere-optane-pmem-best-practices-guide.html
vMMR0U3 Demo video on intel.com: https://www.intel.com/content/www/us/en/partner/showcase/vmware/vmworld-21-optane-pmem-mainstream-video.html

Footnote:

(*) IDC, Data Creation and Replication Will Grow at a Faster Rate than Installed Storage Capacity According to the IDC Global DataSphere and StorageSphere Forecasts, March 24, 2021

https://www.idc.com/getdoc.jsp?containerId=prUS47560321

(**) Sample TCO Study: https://saponpower.wordpress.com/2020/02/04/does-intels-optane-dc-persistent-memory-decrease-tco-for-sap/