Accelerating Sustainability: Practical Strategies to Optimize CPU and Power Efficiency for Telecom Workloads

Every Communication Service Provider (CSP) would like to run their environment more efficiently, and not just to lower operating expenses. For many, reducing energy consumption and carbon emissions, and generally improving sustainability, have become top organizational priorities. The question is how to actually do it.

As we discussed in our previous blog, you don’t have to wait for transformative new technologies to start making progress; there are opportunities to achieve concrete efficiency gains right now. In fact, VMware offers a variety of tools to fine-tune workloads and host configurations that, when added together, can deliver significant improvements.

VMware ESXi and vSphere performance-tuning options allow you to change parameters to adjust the balance of performance vs. efficiency for different workloads. But the reality is that not all optimizations are worth while (or even possible) for all types of workloads, at all times. Indeed, performance-tuning matters much more for some workloads than others.

When we do onboarding work for customers, we classify four major categories of workloads:

User Plane-intensive workloads that directly transfer real-time subscriber content and communications. Examples include Distributed Unit (DU) workloads in the Radio Access Network (RAN), as well as 5G Core network functions like User Plane Function (UPF). These workloads typically require both very low latency and very high throughput.

Signaling workloads that handle communications to establish, control, and manage both internal and subscriber-facing services (such as those used to establish a call or facilitate roaming, for example). From a performance perspective, different signaling workloads may have different CPU requirements, but they all tend to be real-time or near-real-time scenarios.

Control Plane workloads handle the setup and management of various network functions and services. Examplesinclude scenarios like a RAN Centralized Unit (CU) managing different DUs, as well as all 5G Core supporting functions, such as Access and Mobility Management Function (AMF). These workloads often have strict latency requirements but tend to be less throughput-intensive.

“Everything else”—acatchall category that includes Operations/Business Support Systems (OSS/BSS), subscriber databases, analytics systems, IT workloads; basically everything that isn’t real-time. Today, this type of workload might be an AI/ML application for analyzing customer data, for example. But it also includes all the classical IT and business applications—Customer Relationship Management (CRM), ticketing systems, the organization’s website, and many others—that any business depends on, including CSPs.

Let’s take a closer look at these workload types. As we’ll see, opportunities exist in multiple areas to use strategies like Distributed Power Management (DPM), oversubscription, Host Power Management, and other VMware features to optimize CPU resources and reduce power consumption.

Signaling Workloads

Signaling workloads typically offer little opportunity to optimize CPU resources. While the signaling plane itself is not especially throughput-intensive, these workloads still require very low latency to meet the narrow reply windows imposed by various signaling protocols and, ultimately, the physics of RF propagation. Even when power saving interventions are technically feasible, they can add significant jitter and latency—making the system more likely to miss those windows—so we generally configure the highest-performance settings.

User Plane Workloads

User Plane-intensive telco workloads typically demand both low latency and very high throughput, so typically offer fewer opportunities for efficiency-tuning. For Core workloads in particular, most CSPs focus almost exclusively on performance. The key question is usually whether to add more servers to assure reliable subscriber experiences. Even when teams know these servers will be underutilized, they believe the risk of performance loss outweighs any potential benefits of energy-saving features. After all, even large energy savings won’t mean much if the system keeps dropping calls.

User Plane workloads in the RAN have similar characteristics, but here, we do find opportunities to optimize via hardware selection. You should be asking, “What’s the most energy-efficient CPU I can get, that still meets my packet-per-second requirements for this region?” For example, don’t just buy the server with 20 Network Interface Controllers (NICs); use one with the minimum NICs you need. See if single-socket systems are viable for your workloads. And try to identify the weakest CPU you can get away with, because a server’s Thermal Design Power (TDP) and cooling requirements depend on its loadout. Even if you can achieve very small savings on the order of a few Watts per base station, they add up to huge efficiencies when expanded to the scale of the RAN.

Control Plane Workloads

Control Plane workloads can have fairly strict latency parameters to meet, though generally not as strict as User Plane workloads. However, they have vastly lower overall throughput, so in general, much lower CPU utilization. As a result, these workloads present opportunities to de-schedule specific cores to optimize power efficiency in ways that wouldn’t be possible for User Plane workloads.

For Control Plane workloads running on later versions of ESXi, for example, you can use selective latency sensitivity. In the past, it was already possible to exclusively grant cores to a given virtual machine (VM), but you had to pin all vCPUs when using that feature. Since vSphere 7.0 U1, you have the ability to pin only a partial number of CPUs to a VM. Unlike a DU or User Plane workload, where you’d have most cores running DPDK poll threads, these workloads with few vCPUs requiring exclusive affinity can present viable targets for performance-tuning to optimize utilization and power efficiency.

Everything Else

As in any large organization, telco workloads for classical IT and business applications offer the greatest opportunity for optimization. For most CSPs, the best target to start with is OSS/BSS. These workloads can have quite a large footprint, but they typically don’t require the level of performance, neither throughput nor latency, as User Plane or Control Plane workloads. As a result, they offer opportunities to take advantage of a much wider range of techniques and VMware features to consolidate resources and improve power efficiency.

Many of these approaches have been used by enterprises for years to adjust power profiles, consolidate workloads, and generally optimize the performance and utilization of data center resources. So, while some may not yet be widely deployed in telecom environments, they are mature, proven techniques that can be applied to any enterprise IT-like workload. Examples include:

CPU Power Management Policies: vSphere offers the option to control certain CPU features via Host Power Management Policies. While many telecom workloads obviously do need to prioritize performance, we find that some telco organizations default to the highest BIOS and ESXi power settings for everything. For many types of workloads in the “Everything Else” category, however especially OSS/BSS, a “Balanced” or even “Low Power” policy for certain times or test environments, can be used to dynamically throttle the CPUs when a server is underutilized or completely idle.

At the basis is a BIOS configuration that matches the maximum performance requirements but allows ESXi full runtime control over the subset of controllable features. This allows a more power efficient default without sacrificing the ability have the same, deterministic performance when and where necessary.
Host consolidation: Many telco organizations have historically avoided overcommitting server CPUs, but consolidation is absolutely possible. Of course, you typically wouldn’t overcommit CPU for User Plane or even Control Plane workloads. But for OSS/BSS and similar systems, start by reviewing the actual utilization ratio of servers, and you may find many opportunities. Whether that means looking at rightsizing of workloads that have sprawled beyond what is necessary due to unchecked utilization assumptions or moving beyond only guarantee-able resources towards capacity planning based on useable resources. Those hardware savings reduce CapEx but also translate to significant ongoing power savings. If you can eliminate a single server by driving your consolidation ratio down, you immediately save 200 W, minimum. (Use VMware Aria Operations to identify what resources different workloads actually need, and start identifying potential targets for consolidation.)

Distributed Power Management: Even afterconsolidating physical servers,you may find opportunities for additional power savings using VMware DPM, an optional feature of the vSphere Distributed Resource Scheduler (DRS). DPM can dynamically consolidate workloads even further during periods of low utilization, migrating VMs onto fewer ESXi hosts and shutting down those that are unneeded. Depending on the environment, this can be used completely automated and further extended through Predictive DRS, but it can also be scheduled for specific hours only. (See our recent blog post for more, as well as this white paper for technical details.)

Test/Lab/Validation Environments: The intermittent nature of labs environments lends itself well to a wide range of efficiency techniques. Significant savings can be reached when labs environments are not in use.

Looking Ahead

We shouldn’t overlook some of the exciting innovations on the horizon that will help CSPs achieve big sustainability gains. To name just a couple, VMware is working on solutions to offload traffic to Data Plane Processing Units (DPUs), which can enable power savings in every RAN base station. VMware also continues to lead in RAN Intelligent Controller (RIC) innovations that can enable more granular power management within the radio head itself. For example, one rApp currently in trials will allow CSPs to turn off actual radio channels when not in use—directly addressing the part of the RAN that makes up the vast majority of a CSP’s overall energy consumption. Looking further into the future, enabling workloads to utilize CPU power management capabilities like P & deep C-States at the guest level can drive significant benefits for optimized PF, CU & DU workloads.

All innovations build on a common theme: at the scale of thousands of base stations, even small efficiencies add up to massive improvements in energy consumption and emissions. But there’s no reason to stand still while we prepare for tomorrow’s RAN. CSP core networks and data centers are sizeable environments themselves. By taking advantage of some of the ESXi and vSphere features described in this blog, we can start building the greener telco environments of the future, today.

Read to discuss your specific scenario? Reach out to talk to an expert.

Signaling Workloads

User Plane Workloads

Control Plane Workloads

Everything Else

Looking Ahead

Related Articles

Embracing End-to-End Service Assurance to Move from Correlation to Remediation

On the Journey to Become Agile TechCos, Telcos Don’t Have to Go it Alone

Embracing End-to-End Service Assurance to move from Chaos to Confidence

Operational Clarity over Ambiguity: Tap Multi-Layer Assurance to Automate Remediation for Diverse Networks

Automate to Acclimate: How to Rapidly Launch and Modify 5G Services on the Fly