In the last 10 years, I’ve had the privilege of working on cloud security at Microsoft, Google, and Pivotal. Most recently at Pivotal, I’ve worked closely with some of the most forward-looking enterprises in the world. All of them want to deliver applications at a faster pace, and they are willing to try new tools, techniques, and processes to get there. I’ve observed another common trait — security is a top concern, both with their existing infrastructure and their next generation cloud infrastructure.
Behaviorally, there is an instinctual reach for previously defined tools and methodologies to help ensure the appropriate level of security. Often they are calcified within the organization. Some are helpful, some are not. In this post, I’ll describe what I believe to be the single most important concept for an enterprise security organization to grasp when evaluating cloud infrastructure. It’s a radical change from the status quo, but I believe it will dramatically and immediately improve the security posture of any IT organization.
Its idea is quite simple. Rotate datacenter credentials every few minutes or hours. Repave every server and application in the datacenter every few hours from a known good state. Repair vulnerable operating systems and application stacks consistently within hours of patch availability. Faster is safer. It’s not a fantasy — the tools exist to make most of this a reality today. Do it, and you’ll see a dramatic improvement in enterprise security posture.
Before I describe why and how this works, let’s first take a step back and look at today’s enterprise security culture.
The Trap Of Resisting Change to Mitigate Risk
The sad truth is that the foundation of traditional enterprise infrastructure centers on resisting change. Firewall rules, long-lived TLS credentials, and hard-to-update databases support that hypothesis. It’s natural to expect that an enterprise security team, after decades of consuming infrastructure that resists change, would have a culture that also resists change. It’s a seller/buyer socio-technical system. Traditional approaches force enterprises to choose between moving fast and accept unbounded risk, or slowing down and try to mitigate risk. Everybody chooses slowing down. It’s an easy decision when it’s your job to protect an organization.
The Dreaded Mega-Breach
At or near the top of security concerns in the datacenter is something called an Advanced Persistent Threat (APT). An APT gains unauthorized access to a network and can stay hidden for a long period of time. Its goal is usually to steal, corrupt, or ransom data.
It’s the dreaded, front-page newsworthy mega-breach.
A lot has been written about the anatomy of an APT. Unfortunately, APT has become an umbrella buzzword, so it will likely mean different things to different people. In this post, I’m focusing on an attack that worms its way into the datacenter, sits in the network, observes, and then does something malicious. To avoid the hype, I’ll simply call it an attack and leave the labeling to the reader.
I believe these types of attacks need at least three resources in order to blossom — 1) time, 2) leaked or misused credentials, and 3) misconfigured and/or unpatched software. Time gives the malware more opportunity to observe, learn, and store. Credentials provide access to other systems and data, possibly even an ingress point. Vulnerable software provides room to penetrate, move around, hide, and gather more data. These are like sunlight, water, and soil to a plant. Remove one or more and it’s not likely to mature.
Now, consider the relationship between the calcified socio-technical system and attacks. First and foremost, there’s lots and lots of time. For example, credentials seldom rotate. So, if an attacker can find some, they are likely to remain valid and useful for a long time. As well, it often takes months to deploy patches to operating systems and application stacks, even in a virtualized world. It’s not uncommon for an enterprise to leave a server vulnerable for 6 months or more. Almost no one regularly repaves their servers or applications from a known, good-state. Instead we often apply incremental changes, so the slate almost never gets wiped clean. Traditional enterprise software vendors and the trap of the rigid enterprise create the rich, fertile, undisturbed pastures for attacks to flourish.
The Acme Pattern
To get a clearer picture of what this means in real terms, let us look at the accreditation process at Acme corporation as an example. The process is there for a good reason, but it also has a nasty side effect.
Let’s say Acme has an enterprise accreditation process that takes two months, and it’s required on every major software release. It’s there to ensure baseline security standards and to ensure the new version doesn’t break existing systems. If a software vendor releases version 2.0 in January, Acme starts the accreditation process in early February. There’s a minor hiccup, so the process doesn’t complete until mid-April. Installation is planned for June, and it takes one month to complete. The total delay is at least six months — plenty of time to give an attack all the resources it needs to transition from a seedling into a monster.
There’s more. Acme will often push back on the software vendor to keep prior versions of their software patched for a long period of time. This further complicates improvements because the software vendor must devote non-trivial resources to this effort. Those resources can’t be used to improve the product, so releasing new versions of software can take even longer.
The cycle perpetuates and grows, and inadvertently feeds the attacker. The cycle needs to be broken.
Faster is Better
What’s the industry’s response to this phenomenon? I’ll give you a clue — it isn’t working.
Enter a parade of security software vendors. With the confluence of the slowly changing socio-technical culture and attacks, gas is poured on the demand for security monitoring and detection tools. I guess the reasoning is something like, “It can’t change quickly, so change is a sign of a malicious actor.” We’ve resigned ourselves to live with slowly changing infrastructure, so we spend lots of money monitoring it for change and hoping for the best.
This is a decision to treat the symptoms rather than cure the disease. In fact, many security monitoring solutions embody a self-fulfilling prophecy — since updating a system often looks like an attack, you either stop paying attention to alerts or you resist updates. Both help attackers. I’m not saying all monitoring and detection is unnecessary — I’m asserting that it’s palliative treatment at best. Most monitoring solutions help enterprises deal with attacks in about the same way a table knife helps one eat a bowl of soup.
If you identify with the above reasoning, then it’s natural to wonder about the cure. I don’t have all the answers, but I believe the key is to starve attacks of the resources they need to grow into monsters.Rotate the credentials frequently so they are only useful for short periods of time. Repave servers and applications from a known good state to cut down on the amount of time an attack can live. Repair vulnerable software as soon as updates are available.
Rotate, repave, repair. I call these the three Rs of enterprise security.
Starving Attacks with the Three Rs — Rotate, Repave, and Repair
At high velocity, the three Rs starve attacks of the resources they need to grow. It’s a complete 180-degree change from the traditional careful aversion to change to mitigate risk. Go fast to stay safer — in other words, speed reduces risk.
To an attacker, it’s like playing a nearly unsolvable video game. She needs to get to level 100, but she can’t get past level 5 because there’s not enough time. In addition, what worked the first try didn’t work on the 20th try.
The promise of the three Rs is profound.
It’s no secret that I’m biased towards Pivotal Cloud Foundry, BOSH, Concourse and OpsManager. By themselves and in their current state, these products change the attacker/defender game dramatically. It’s totally possible to repave every VM in your datacenter from a known good state every few hours without application downtime. Deploy your applications from CI, and your application containers will also be repaved every few hours. Our patch turnaround time for the entire stack is second to none, and you can deploy those patches to your entire datacenter with a few clicks of a mouse.
We’ve got the repave and repair angles pretty well covered with OpsManager, BOSH, and Concourse, and Pivotal Web Services. For example, all the VMs in a Pivotal Cloud Foundry cluster are imaged with an image called a stem cell. Though it’s not yet a default option, It’s possible to repave every VM in the cluster on an interval of your choosing with BOSH. Pivotal Web Services automatically updates buildpacks to ensure application environments are always patched. We’re still working on automated credential management, so stay tuned for updates on that front.
From a security point of view, I can’t think of a reason not to embrace this model immediately. Regardless of the tools you’re investigating, I encourage you to consider the three Rs when evaluating the security of your cloud platform. If a tool doesn’t help you get there, then it’s probably best to run away.