Financial Management Migration

Reviewing Cloud Costs From An Engineering Director’s Perspective

At CloudHealth by VMware, we believe in using our own platform to manage our internal costs at the team level. In this exercise, I’ll analyze the spend for two CloudHealth engineering teams and show you where we can take advantage of some cost-saving opportunities—and hopefully inspire you to think about where you can reduce cloud costs in your own organization.

The cloud cost breakdown for our team

To establish a baseline understanding of cost we can use CloudHealth to filter by Perspective, in this instance by team, for all assets owned by Team 1 and Team 2. We can then filter down by date range to include data for just the month of May. Segmenting our data in this way is possible using CloudHealth’s tagging capabilities, which we adhere to strictly.

A cost history report, for both staging and production environments, shows the following:
engineer-cost1.png

engineer-cost2.png

You should disregard any “RI Prepay” line items in the cost history report because these are one time RI purchases and will skew the results. It’s better to use the Amortized Cost History, as it shows the breakdown of the RI purchase over the period it was purchased (1 or 3 years) instead of one large “RI Prepay” item.

Looking at the data above we can see there is $6811.88 per month on various AWS services, the two largest services by far are EC2 Compute ($2716.81) and S3 – API ($1413.60). Our annual cost is therefore $64061.76.

There’s also a RDS cost of $1191.89, which is attributed to the benchmark pricing database and is provisioned as a Multi-AZUsage:db.m5.2xl RDS instance—we need to check and see if this is overprovisioned. Stay tuned as I’ll be discussing this in a future blog post.

If we look at the amortized cost there are some credits applied for usage (-$871.35), but also an amortized cost for the upfront RI purchases which is shared out monthly of $2104.10.

engineer-cost3.png

This means the fully loaded monthly cost is $8044.62. However, there is nothing we can do to reduce the prepaid amount, so we can work on reducing any monthly costs not covered by RIs and ensuring that we have the right size instances so that we reduce our future spend on RIs.

That will be done in a series of steps: identify the highest costs, use CloudHealth to optimize costs (e.g. rightsizing), and then do a human assessment of how additional costs can be saved through re-architecture and waste reduction, for example using AutoScale Groups.

There are also additional costs not captured here that are centralized, such as Redis Cache, RabbitMQ, etc. however those will be managed by a centralized team as part of their cost breakdown.

Options for cloud cost optimization (CloudHealth recommendations)

Using the CloudHealth Platform, we can optimize our EC2 spend in two ways:

Option 1: We can purchase RIs or Savings Plans to reduce our spend, however that option is better managed at the organizational level, rather than the team level, so that we can maximize our benefits from floating convertible RIs or Savings Plans coverage.

Option 2: We can rightsize the machine. You can identify rightsizing opportunities by running a rightsizing report, filtered on perspective group Team, and selecting Mates.

There are 74 EC2 instances, the vast majority of which can be downsized, saving $10- per month. Before we rightsize these instances however, we have to check to see if the instances will be covered by reservations (since it appears almost all the instances are not currently covered). Because almost all instances are covered by reservations, there are no true savings to be had here. 

That being said, it’s still a best practice to only provision what you need—efficient provisioning will reduce wasted spend. This will also free up some RIs for other teams, and result in a lower RI renewal in the future.

engineer-cost4.png

Now we can see how high our RI coverage is for these workloads.

To understand your On Demand compute costs, you can use the EC2 Instance Report, change to monthly, filter down to your team and month, and then change the Y-Axis to Compute Cost($) and look at the Compute Cost($) tab.
engineer-cost5.png

You can see we have very few resources running On Demand which means we have good RI coverage for all these EC2 instances. The next step is to work directly with experts and those closest to the workloads to understand if we can further optimize the workloads.

Options for cloud cost optimization (recommendations from our tech lead)

Taking the data in the CloudHealth platform and combining it with expert knowledge is the right way to optimize cost.

Recommendations from our Tech Lead include:

  1. The biggest savings recommendation is to set up a scaling policy for PerfMon workers to flex based on queue size. Savings potentially for 15 x r5.large instances, which amounts to $1383.60 per month.
  2. Move one processing job for Data Center to Resque to remove two instances (r5.xlarge) amounting to $368.94. These can also be downsized since there is no need for xlarge, which would save $184.48.
  3. Noted that the S3 API is likely the Benchmark Pricing service making many calls. A detailed analysis shows it’s because of calls between benchmark pricing and the migration services (HealthCheck), Benchmark Pricing is responsible for 100% of the cost. The migrations service calls are in the free tier. The recommendation here is that Benchmark Pricing should zip all the assets per customer before storing them in S3. This is because Benchmark Pricing generates separate files per each Asset, so for a customer with a 1000 Assets, it will generate over 1000 writes. This will cause 100000+ API requests to S3, which later follows with a 100000+ read API operations to import that data to DB. If we zipped all the assets into a single file per customer, the API thresholds would reduce significantly. We would likely save the full $1413.60 since this would reduce API calls by a massive amount.

What are our options? The cost of resolution

Below are three recommendations, each of which assumes an engineering week costs $2500:

  1. Recommendation 1 would take 1 engineering week ($2500) to resolve, with a savings of $1383.60 per month, which means after 2 months this will pay for itself (this recommendation is well worth the effort).
  2. Recommendation 2 would take 3 weeks ($7500) of effort to resolve, and by moving to Resque, savings would only be $368.94 per month (meaning it would take 2 years to recoup the cost which is not worth the investment). However, downsizing to r5.large is only 15 minutes of work and would save $184.48 per month (or $2213.76) if these are not covered by RIs or SP.
  3. Recommendation 3 rough estimate 2 two week sprints, which is $10000- which would recoup the cost in 7 months due to the $1411.23 savings per month. This recommendation is also worth the investment given the big savings once you’ve reached month 8.

Total Savings of $35751.72 on a $96535.44 spend representing a savings of 37% of the cost.

I hope this insight into how we’re using CloudHealth internally to optimize our cloud costs is helpful for you to use for your organization. Stay tuned for my next blog where we’ll look at how to optimize RDS costs.