Cloud Updates Optimization Tips

Rightsizing: The Foundation for Optimizing Your Cloud Infrastructure

Most businesses understand how important it is to rightsize resources in the cloud in order to optimize their cloud infrastructure, performance, and cost-efficiency. But many don’t know how their resources became over- or under-provisioned in the first place, how to determine which resources are incorrectly provisioned, or how to prevent it from happening again.

In this article, we’ll explain what rightsizing is, why it’s important, how to identify opportunities for rightsizing, and how to implement an effective rightsizing strategy for long-lasting cost savings and operational efficiency in the cloud.

Rightsizing defined: 

Rightsizing is the process of matching instance types and sizes to your workload performance and capacity requirements at the lowest possible cost. Rightsizing can be defined in three steps:

  1. Analyze the utilization and performance metrics of your infrastructure, such as instances, volumes, and virtual machines (VMs)
  2. Determine whether or not they’re running efficiently, and what actions you should take to improve efficiency
  3. Modify the infrastructure as needed (upgrading, downgrading, terminating)

Benefits of rightsizing

Why should you rightsize? As you’re analyzing your cloud infrastructure, you’ll come across assets that can be downsized or terminated to save money, or upgraded to improve performance. By rightsizing your cloud resources, you’ll benefit from infrastructure optimization and reduced costs.

When assets have low utilization for core performance metrics, such as 20% or less, that often means the asset is underutilized. In this case, the best practice is to downgrade the asset to a smaller footprint. For example, in AWS if you’re running a workload on a r3.2xlarge, but determine via rightsizing that you could downgrade the instance type to a r3.xlarge instance without negatively impacting the workload, you can cut your operating costs by 50%.

It’s also a best practice to terminate assets that are considered “zombies,” which are assets running in your cloud environment but not being used. Zombies occur when someone may have forgotten to turn the assets off, or the asset failed because of script errors. Regardless of the cause of the zombies, your cloud provider will continue charging for these assets because they are in a running state. By finding these assets and terminating them, you can reduce wasted cloud costs.

Downgrading and terminating assets helps optimize and reduce costs, but upgrading usually causes an increase in costs. However, upgrading ensures your assets are able to meet peaks and surges in demand. For example, in Azure, if you have a Standard_A2 VM with usage spikes that consistently hit 100% utilization of CPU or memory during certain times of the day, you want to analyze the hourly max utilization throughout the day and see if the VM requires a larger size, such as the Standard_A3, or the burstable B-Series, in order to optimize performance.

Provisioning a resource to match its workload is not easy

Provisioning a resource to match its workload is not easy. For example, other than the customizable VM instances available on Google Cloud Platform (GCP), most instances and VMs double or halve in size (and price) for each step up or down in size you take. This means that if you have a workload that only requires 6 vCPUs and 24 GiB of memory, you’ll have to opt for an 8 vCPU/32 GiB configuration even though 25% of the capacity you’re paying for will never be used.

Furthermore, when on-premises workloads are migrated to the cloud, configuration calculations are usually based on the configuration used on-premises. Often these calculations fail to take into account that virtual machines in the cloud are more powerful and efficient than on-premises servers, resulting in over-provisioning at the point of deployment. Alternatively, demand may increase or decrease during the lifetime of the virtual machine.

What are the metrics you need to consider when rightsizing?

On the compute side, the core metrics to take into account are vCPU utilization, memory utilization, network utilization, and disk use. It’s a best practice to have pre-defined thresholds for what constitutes normal behavior for each of these metrics because it’s not necessarily the case that a virtual machine with 30% vCPU utilization is a suitable candidate for downsizing to a VM half the current size.

If memory utilization, network utilization, and/or disk use is above 50% of the provisioned capacity, downsizing a VM to half its current capacity will likely affect workload performance. In these circumstances, it may be better to change the VM family from, for example, General Purpose to Compute Intensive or Memory Intensive.

Tools to help measure utilization metrics and effectively rightsize your cloud infrastructure

If your business operates across multiple clouds, CloudHealth is the natural choice for measuring utilization metrics and obtaining rightsizing recommendations because the CloudHealth platforms collects utilization metrics from every source, collates them, analyzes them, and gives you a holistic view across your cloud environments via a single pane of glass.
Even if you don´t operate in a multi-cloud environment, the total visibility into cloud activity you obtain through the CloudHealth platform helps you identify cost drivers, performance inefficiencies, and security issues.

resource utilization in cloudhealth with rightsizing recommendations

A key benefit of using CloudHealth to optimize your cloud infrastructure is that the platform has policy-driven automation capabilities that can help over-provisioning from occurring. The platform can be configured to alert you to opportunities to rightsize your infrastructure as they occur so you don’t have to go looking for them after receiving a higher-than-anticipated cloud bill.

To find out more about the CloudHealth platform, do not hesitate to get in touch and speak with our team of cloud experts. Our team will be happy to organize a free demo of the platform so you can see the level of visibility CloudHealth provides across multiple clouds and also so you can see how the concept of policy-driven automation works in practice.

For even more information and best practices around cloud infrastructure optimization, see our in-depth whitepaper: Benchmark Your Cloud Maturity: A Framework for Best Practices