Image of business people hands working with papers at meeting
Uncategorized DevOps Emerging Technologies

Site Reliability Engineering in VMware Environments

Kevin Lees

Field Chief Technologist for IT Operations Transformation

 

Companies seeking to increase velocity and reliability of solutions within their digital business should shift their software development efforts “further to the right” into infrastructure and operations (I&O) teams by adopting tenets of Site Reliability Engineering (SRE). The SRE ethos was conceived at Google to help them run their products and services smoothly, efficiently and reliably at scale.

SRE is defined as “what happens when you ask a software engineer to design an operations team.”SRE practitioners analyze business services to determine their actual required availability (which in actuality is seldom 100%) and then specify the operational strategy, including deployment frequency, to meet the availability requirement. This is often a fine balancing act between maintaining the desired availability and getting new features to users faster.

VMware CEO Pat Gelsinger talks about “the gap” between infrastructure, the teams that manage infrastructure, and the “crazy application folks.” The developers are concerned with creating new features and bringing them to market as quickly as possible. The I&O team is concerned with operational requirements: security, compliance, governance, and the reliability of the virtual environments used (VMs, containers) to reduce risk and maintain stability.

This gap slows the business in meeting its desired outcomes and generating shareholder value. DevOps has long been hailed as the solution to these problems, and SRE, as a superset of DevOps principles, promises to provide a prescriptive and holistic approach to doing so.

The proliferation of software‐defined environments has expanded the breadth of activities to which SRE concepts can be applied because they encourage and accommodate far higher levels of programmability and automation. From a VMware field perspective, SRE concepts should be applied equally to addressing IT service reliability. Services provided by IT can include application‐based business services, the “traditional” SRE area of focus, as well as include:

  • Infrastructure as a Service (IaaS)
  • Platform as a Service (PaaS)
  • Containers as a Service (CaaS)
  • Other IT services such as desktop services or data analytics services

As with applications that make up business services, SRE practitioners analyze IT services to determine their true reliability requirements, and then develop a resulting operations strategy, including a new capability deployment frequency to meet those requirements. SRE practitioners also proactively de ne service frameworks addressing operational considerations such as instrumentation and logging as well as for building reliability into the application itself, to help developers deliver applications that support operational reliability.

The primary premise of this SRE White Paper is to discuss the application of SRE concepts to maintaining IT service reliability in VMware software‐defined environments. The paper covers:

  • Key concepts
  • SRE applied to VMware-supported environments (Software-Defined Data Center, Hybrid Cloud and Multi-Cloud, Cloud-Native and Hybrid Applications)
  • Operating model considerations (People and Process)
  • Adoption (our Advisory Services team is a good place to start.)

 


Kevin’s focus is on customers optimize the way they operate VMware-supported environments and solutions. Kevin serves as an advisor to global customer senior executives for their IT operations transformation initiatives. He also leads the IT Transformation activities in VMware’s Global Field Office of the CTO.