Uncategorized

Set it and Forget it

When it comes to scaling applications, let the platform do the heavy lifting

Image via Garmin.

Do you remember that old infomercial for the Ronco rotisserie oven? If you were up channel surfing at 2am during the ’90s, you couldn’t miss it. All you needed to do to make the perfect rack of lamb, we were told, was sprinkle on some seasoning, pop the meat in the oven and — all together now — SET IT AND FORGET IT!

That approach may have worked for lamb and roast chicken, but it turns out it doesn’t apply equally well to traditional application infrastructure. I guess Ron Popeil didn’t have much IT experience.

The truth is, a lot of enterprises do in fact provision their virtual machines (VMs) and then forget them. In fact they often over-provision to ensure applications maintain performance during peak demand. This is totally understandable, of course. A retailer can’t afford for its web application to crash during Cyber Monday, for example, when US consumers spend over $3 billion shopping online in a single 24-hour period. Better to over-provision than risk losing out on that revenue.

Enterprises that don’t provision for peak demand can end up getting blindsided.

The problem is that these moments of peak demand are infrequent and many enterprises never decommission the extra capacity when demand returns to normal. The result is armies of virtual machines that could be put to use elsewhere sitting idly, sometimes for months at a stretch.

The flip side is equally problematic. Enterprises that don’t provision for peak demand can end up getting blindsided. Some periods of peak demand are predictable, like Cyber Monday, but others are not. A celebrity tweets about his new favorite app and, boom, demand spikes ten-fold in a matter of minutes. It could take too long to add capacity before the service goes down and takes thousands of potential new customers — and revenue — with it.

 

Beware Zombie VMs

For Garmin, providing a great user experience for its customers is paramount, so it made sure to provision enough capacity to handle its busiest time of year: the holidays. The maker of GPS-enabled wearables enhances its physical products with a number of online services. Runners that use Garmin’s fitness trackers, for instance, can log into to Garmin Central and upload their activity data to track their progress over time and share with friends. Many log on for the first time on Christmas Day, right after they unwrap their new device from Santa, resulting in a surge of traffic to Garmin’s single sign-on (SSO) application.

To ensure its users have a great first experience, Garmin had two instances of its SSO application running in two data centers, with 12 VMs per data center, according to Jonathan Regehr, a senior software engineer at the company. That was significantly more capacity than was required for most of the year, but it ensured the SSO app stayed up and humming on Christmas Day. But what about the day after that? And the rest of the year?

“The problem with doing that on VMs is you just stay that way forever,” said Regehr. “Christmas comes once a year — we all know that — so we’re scaled for that. But hey, we’re just going to keep those VMs running all year long.”

Garmin is far from alone. New research from Anthesis, an Emeryville, California-based consultancy specializing in sustainability, finds that the problem of over-provisioned VMs is widespread. According to a new report from the group, around 30 percent of VMs are what it calls “zombies,” meaning they showed “no signs of useful compute activity for 6 months or more.” That translates to billions of dollars worth of data center capital sitting languishing, Anthesis found.

The upside for enterprises that address the problem is considerable, according to Anthesis.

“The ability to eliminate [idle VMs] can result in considerable capital and operational savings when one takes into account resources needlessly wasted on power, hardware, licenses maintenance, staffing, and floor space,” writes Jon Taylor, a partner at the Anthesis Group and lead author of the new research, in an accompanying blog post. “It can also improve data center security, since zombie servers are much less likely to have security updates.”

Identifying and decommissioning idle VMs is easier said than done, however. The reality is most IT departments are stretched thin as it is, with other, more pressing fires to fight. That’s why there are so many zombie servers to begin with. But what if there was a way to automate the process of eliminating unneeded VMs after traffic surges subside? Of course, you’d also have to automate the process of generating new VMs when traffic surges in the first place. They’re two sides to the same coin. Automating one without the other only solves half the problem.

 

Let App Autoscaler do the Work

Turns out there is a solution, one that Garmin has since put into practice. The company deployed Pivotal Cloud Foundry, a cloud-native platform purpose-built to support modern applications that lets operations teams automate many important, but non-differentiating infrastructure-related tasks. Among its features is the App Autoscaler.

The App Autoscaler automates the process of adding capacity to applications running on Pivotal Cloud Foundry when traffic spikes and scaling down capacity when demand returns to normal levels.

Just as the name suggests, the App Autoscaler automates the process of adding capacity to applications running on Pivotal Cloud Foundry when traffic spikes and scaling down capacity when demand returns to normal levels. This ensures applications don’t buckle under the pressure of unexpected increases in demand and removes the need for operations teams to remember to decommission excess capacity after the fact.

“There’s a fluctuation of load that applications need to service,” said Scott L’Hommedieu, who runs product management for App Autoscaler at Pivotal. “But capacity planning is an old school technique and its sort of a black art. The App Autoscaler takes a lot of the guesswork and uncertainty out of the process.”

It accomplishes this, L’Hommedieu explains, by monitoring CPU usage and adjusting application resources accordingly. But it doesn’t leave the operations team out of the loop. Ops teams can set minimum and maximum capacity thresholds either on a case-by-case basis or following some type of schedule. Garmin could, for example, increase maximum capacity on Christmas Day, but then leave the App Autoscaler to deal with spikes and fluctuations in demand up until the maximum threshold is met. If that occurs, App Autoscaler alerts the ops team, which then can decide whether to increase the maximum threshold or take some other action.

“The reality is sometimes an increase in CPU usage doesn’t necessarily mean there is a huge spike in traffic,” L’Hommedieu said. “Rather, it can indicate there is a problem with the way the application itself is configured. So with maximum thresholds in place, we can alert folks that, hey, you probably should take a look at how your application is running.”

In addition to monitoring and responding to fluctuations in CPU usage, L’Hommedieu and the product team recently added http latency and http throughput as additional metrics that enable the App Autoscaler to be even more precise in how it scales application resources.

“Applications are bound by more than just CPU usage. There’s disk i/o, memory, and database constraints that also impact an application’s ability to handle more requests,” L’Hommedieu said. “We found that http latency and http throughput are even better indicators than CPU of resource consumption on the system. In a request-based application like an API app they are very good proxies for overall demands on the system.”

 

Set it and ….

With its applications now running Pivotal Cloud Foundry, Garmin is making much better use of its application infrastructure thanks, in part, to the App Autoscaler, according to Regeh. “Autoscaling is efficient and fast,” he said. “Let Cloud Foundry do the work.”

Like all software at Pivotal, App Autoscaler continues to improve as L’Hommedieu and his colleagues add new capabilities based on customer feedback. L’Hommedieu said the team hopes to eventually develop a similar service but for the platform, automating the process of scaling the infrastructure underlying Pivotal Cloud Foundry itself.

Ultimately, Pivotal Cloud Foundry and App Autoscaler ensures enterprises aren’t caught off guard by sudden increases in traffic to their mission-critical applications, potentially leaving customers with bad tastes in their mouths and revenue on the table. But it also means that once ops teams determine the appropriate minimum and maximum resource thresholds for their Pivotal Cloud Foundry applications, they really can just set it … AND FORGET IT!