bosh cloud native operations

Automated Ops; Freedom To Innovate

This is part of a series on Cloud-Native operations

In 2013 an Oxford University study found that up to half of all jobs in the US could be eliminated by automation by 2033. The risk varies by occupation; occupational therapists can rest easy while telemarketers would be wise to have a plan-B ready. Skill and pay levels turn out to not be a super-reliable predictor either, as shown in this handy-dandy interactive visualization. The picture in developing economies is even more concerning, with up to 85% of jobs at risk.

Technology has been changing jobs for centuries and has spawned many arguments and academic theories around Technological Unemployment, including the "humans vs robots" zero-sum game, the impact on skills, productivity and prosperity. However, it's clear that those who embrace and master the new technology will be well placed for future opportunities while Jevon's Paradox predicts an ever rising demand for IT along with continued employment for ops teams.

 

From a recent McKinsey research article on automation:

As roles and processes get redefined, the economic benefits of automation will extend far beyond labor savings. Particularly in the highest-paid occupations, machines can augment human capabilities to a high degree, and amplify the value of expertise by increasing an individual’s work capacity and freeing the employee to focus on work of higher value.

For operations teams, increased automation goes hand-in-hand with DevOps practices to address the exploding demands of digital business. It's a unique opportunity to stop drowning in technical debt and use the power of machines to manage machines, enabling ops teams to scale and better serve their business customers at the same time. In my previous post I explored this Cloud-Native Ops opportunity. In this blog I'll look in more detail at Cloud-Native Automation and why you should embrace, not fear, it. Drawing on the experience at Google :

The main upshot of this new automation was that we had a lot more free time to spend on improving other parts of the infrastructure. Such improvements had a cascading effect: the more time we saved, the more time we were able to spend on optimizing and automating other tedious work. Eventually, we were able to automate schema changes, causing the cost of total operational maintenance of the Ads Database to drop by nearly 95%

Google SRE book

Automation or autonomous systems?

Computer automation is nothing new; the shift to machines taking on previously manual tasks has been ongoing for decades. In the IT realm, admins used to manually stand up servers with a physical runbook: a sequence of tasks that creates a known state. Popular configuration management solutions like Chef and Puppet got their start as automated runbooks designed to reliably install and configure software to create a known initial state for a single server.

This approach is akin to the house building robot that lays bricks at the construction site, replacing a human construction worker.

Recapping, the key benefits of automation vs human management are well understood:

  • Consistent repetition without human error. Any manual action performed hundreds of times won't be performed the same way every time with the consistency of a machine.  The lack of consistency causes data quality issues and reliability problems.

  • Faster reaction, mitigation, and more reliable services. Automation means a faster response executed more quickly than a human, resulting in services that can recover and survive where they previously would have caused outages.

  • Ability to scale without additional human resources. Automation is a force multiplier, abstracting away low-level infrastructure detail and enabling individuals to manage more and larger workloads.

But in the age of cloud-native, we've moved beyond individual servers and monolithic applications. The unlimited resources and agility on offer in the public cloud, containerization and microservice architectures mean we're now managing complex distributed systems, at scale. More specifically, our focus has shifted from managing servers to managing complex services that have little human management precedent to draw on. While incremental automation delivers value like a better buggy whip, maybe it’s time to invent the car.

Returning to our house building robot scenario, here’s how one company applied automation to completely change the entire process of house building from design thru permit, factory-build, and onsite construction to deliver a better end customer experience.

Day 1 – the end of the beginning

Managing IT services is not simply a “Day 1” task of provisioning infrastructure, configuring Servers and deploying applications and services. We need a solution that also helps with “Day 2” activities that affect the running service.

Operating at scale requires more than simple automation of individual server configuration runbooks with humans still deciding what needs to run and when. We need to move from automation to autonomous systems; systems that have enough information about the desired state to drive decisions and restorative actions to remediate out-of-line situations without human intervention.

Increasing to a scale and complexity beyond the capability of human management means adopting a different approach. Machines managing machines means a change from procedural automation to autonomous, goal-seeking systems requiring minimal human oversight and intervention.

“For <Google> SRE, automation is a force multiplier, not a panacea…While we believe that software-based automation is superior to manual operation in most circumstances, better than either option is a higher-level system design requiring neither of them – an autonomous system”

Google Site Reliability Engineering Book, Evolution of Automation at Google Chapter

Public cloud and SaaS vendors like Amazon, Microsoft and Google baked this technology into their datacenters from the get-go, realizing that their desired scale and reliability would be unachievable any other way. Now we can benefit from trickle-down technology too!

And the software that runs on this infrastructure has been revolutionized by SaaS vendors like Facebook and Salesforce and the public cloud vendors’ own offerings. Always-on, secure, responsive, services updated multiple times each day require a new architecture, process, and culture from both operation and development teams.

One concrete example of this is how we're transforming hardware into software: compute, network and storage provisioned through APIs; infrastructure defined and controlled through code that replaces manual operations and runbooks; code that needs to live in a source-code repository, needs versioning, testing and backup….sound familiar? In the land the API, development skills reign supreme, hinting at the future skillset for operations teams.

Abstraction is for Ops too!

Developers are familiar with the concept of abstraction. Generations of application platforms have hidden technical infrastructure to make them more productive coder. RedMonk’s Steve O’Grady summarizes this timeline in this blogpost.

“Abstraction is a force of nature in this industry, and if anything it’s getting stronger. Those who fail to recognize this will be supplanted by those that do.”

Steve O'Grady, RedMonk

But on the operations side of the house, the idea and practice of abstraction has been slow to gain traction.

In Part 2 of this blog post I look at automation in a cloud-native platform and how this abstraction and making ops more productive, with a focus on how BOSH does this in Cloud Foundry.