IT Thought Leadership

“Are we ready?” How a simple question redefined VMware production readiness (PRA)

by: VMware VP of Engineering John Tompkins; VMware Senior Director, SaaS Operations Raanan Kesar; VMware Staff II, Cloud SRE Architect Tom Ralph; and VMware Senior Director Brian Smith

In order to successfully compete as a modern enterprise, VMware teams realized that the company had to transition from an on-premise enterprise solution provider to one delivering software-as-a-service (SaaS) offerings. This was a major shift in the way VMware previously operated, both from an operational and mindset point of view. It required disparate teams (often accustomed to working in silos) to work as a whole, designing and operating world-class products built to specific and similar standards regardless of function. This was no easy task, especially given many products are mission-critical to VMware.

In order to ensure this transition (and future growth) goes as smoothly as possible, the Production Readiness Assessment (PRA) program was developed. Its mission is to help all teams involved with building and operating a service to holisticly self-reflect into all areas of production readiness (reliability, security, performance, CI/CD, business continuity, compliance, telemetry, monitoring, deployment, utilization & cost optimization, internationalization, API standards, dogfooding, and finally engagement & escalation), and to ensure these services meet specific requirements before they enter production or have a major milestone. It also was tasked with implementing important intangible principles, such as objective self awareness.

Comprised of a ‘volunteer army’ of experienced service leaders—along with senior level engineers and architects—PRA is not primarily intended to be a gatekeeper to prevent or delay product launches. Instead, our team members are proactively engaged in the early stages of development and through to deployment and operations (building and operating). In a nutshell, PRA is there to answer the question of “are we ready?” before a release.

How Does the Program Work?

Left part of graphic from service lifecycle infographic

courtesy of Site Reliability Engineering, O’Reilly

The process starts with a quick meeting to explain the process and to create a personalized assessment document for each service. This document contains baseline questions to set the stage for a PRA day. Topics include the goals of the service, SLA/SLO/SLIs, network & data-flow diagrams, and similar. About a month before a planned release, PRA and service team representatives meet again to discuss the document, ask questions, share best practices, and to cross-pollinate solutions to avoid duplication of efforts. Next, the service team gives themselves a service maturity rating in each of the areas by utilizing a Harvey Ball ratings system against a pre-defined scoring rubric. The PRA team then meets to discuss prioritization objectives, and shares this with the service team. Finally, the service team internally presents the service maturity rating, their notes, and PRA team recommendations. After the entire PRA process is complete, KPIs are gathered to provide objective, data-based knowledge of service readiness.

Fighting the Misconceptions

There are two things every service owner believes before they experience a PRA—it’s the same as a standard IT review board (ITRB) and it’s a another unproductive gate they must pass to release.  The PRA process differs greatly from a traditional ITRB (typically a high level process of checking boxes to make sure you didn’t forget anything) as it is a much deeper discussion of technology and processes. The ‘gate’ concern is mitigated once the service owner  understands the PRA team is comprised of peers whose intent is to help, not hinder. Another important nuance is that the service team gets to self-assess at the end. They are never ‘graded’ by PRA colleagues.

The Proof Is in the Pudding

Since the program’s inception, the PRA team has earned the trust of the service teams, ‘raised the tide for all the boats’, and helped VMware’s services become better—faster—than was possible individually. Now service teams look forward to second and third PRAs, and frequently engage PRA team members outside the formal process. In fact, the PRA team often partakes in regular service team operational calls in order to accelerate the “are we ready?” process even further.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. Contact your sales rep or [email protected] to schedule a briefing on this topic. Visit the VMware on VMware microsite and follow us on Twitter.