by Brian Smith, Senior Director, Cloud and Productivity Engineering, and Shane van Bentum, Director, Cloud Operations, VMware
VMware’s private cloud is continually evolving as new services are added to meet business demand. As with many established operations teams, Senior Site Reliability Engineers (SREs) were assigned to architecture, engineering, and development, while the Junior SREs focused more on troubleshooting and daily operations.
Over time, the accelerated maturing of existing services, along with demand for new services, meant this operational model was no longer optimal.
To address the growing chasm between the skill sets of junior and senior engineers, we adopted a DevOps mindset to integrate architecture and build with ongoing maintenance and service operations. The more collaborative environment has resulted in greater automation and self-healing in our private cloud, fewer recurring issues, and greater operational efficiency. Details of our journey are explained in this article.
People and Process
VMware IT’s private cloud team is constantly deploying new services as requested by internal customers. The responsibilities were this: Senior SREs were tasked with developing new services. Ongoing maintenance became the responsibility of Junior SREs once the service was in production. Over time, the teams evolved into silos. Often, not enough time was spent in the transition from the development to operations (production) teams. This left the junior engineers maintaining services they hadn’t been fully trained on.
Over time, chronic issues began to surface. Junior engineers tended to band-aid the issues because they saw the symptoms yet lacked the expertise to identify higher-level root causes. They did not have the training, experience, or technical background needed to create self-healing tasks or prevent future outages by modifying the service. Senior engineers were often not aware of the issues as they focused on developing new services. The knowledge gap between the two groups resulted in increasingly long transition periods, operational toil (wasted time), and lower service levels.
The solution was to adopt a DevOps mindset across the services teams, where ‘build, own and operate’ became a core principal and team collaboration was the foundation for continual service improvement. Each team, composed of both senior and junior engineers, was given the responsibility to build, own, and operate its own set of services. Senior engineers were charged with mentoring their less experienced counterparts while junior team members were encouraged to improve their skills.
Actual implementation took a multi-pronged approach. A ‘guru’ rotation was created to manage all unplanned work, enabling the rest of the team to focus on strategic initiatives. Engineers now undergo SRE training, which emphasizes automation and self-healing skill sets using code development. Backlog management is now highly agile, and members rotate in two-week software development sprints. The program was rolled out to regional teams across the globe, empowering them to own and execute their own service enhancements.
New Goal Posts
The new operational model focuses on development and solving operational issues with code. Our team metrics reflect this change. Our focus switched from manual effort and speed of responsiveness to KPIs that focus on auto-remediation targets, automation of daily/repetitive tasks, self-healing/issue avoidance, and the building out of infrastructure-as-code. The metrics help the team focus on improving efficiency, and productivity while tearing down the barriers for taking preventive actions.
Launching this change was both daunting to the leadership and the team because of the challenges inherent in changing mindsets, skill sets, and processes. Nearly six months into the journey, it is evident that this new model has been the right choice. The focus on DevOps has given our engineers a new passion to solve and prevent problems through automation. End user satisfaction is at an all-time high. Best of all, the private cloud operations team has become a much more efficient and agile organization. We are able to meet the increasing demand for new services while still evolving the maturity of existing services, all without adding significant headcount or development time. We continue to look for ways to facilitate this collaboration so that team members can more easily see options for increasing their knowledge and advancing their careers.
To learn more, attend VMworld 2018 session LDT1899, Embracing a DevOps Cloud Operational Model for Managing a SDDC.
VMware on VMware blogs are written by IT subject matter experts sharing stories about IT’s transformation journey using VMware products and services in a global production environment. Visit our portal to learn more or follow us on Twitter: @VMWonVMW.