Site Reliability Engineering - Important Considerations

Imran Abbas, Director VMware Advisory Transformation Services

Important Considerations for SRE

In the world of DevOps, service management teams are adopting Site Reliability Engineering (SRE) to gain efficiencies from automation and service-oriented organizational structures. As the velocity of development and feature release increases, SRE teams must strike the right balance between speed and reliability. The considerations outlined below provide essential perspectives for instituting an effective SRE capability.

Maintain the Strategic Perspective

As SRE teams assume operational responsibility for services, it is essential to maintain the “big picture” perspective. SRE’s must understand the competitive landscape, revenue and mission impact of the service they are supporting. For example, as indicative of the name, reliability is the order of the day. Most outages occur due to an introduction of change. SRE’s must evaluate trade-offs between feature releases vs. system reliability. A strategic perspective will allow development and SRE teams to make the right decisions based on a set of mutually agreed upon metrics. In an ideal SRE environment, the developer to SRE ratio is significant; in some cases, 500 to 1. SRE’s must strike the right balance between speed and maintaining a reliable platform. Therefore, it is imperative that SRE’s provide a clear set of technical & functional requirements for developers to follow. If there is a lack of clarity on requirements and service-levels, SRE’s will be viewed as a bottleneck. Investing in a Platform as a Service (PAAS) solution (or developing one) will also assist in managing rapid development, SRE effectiveness and reduction of downtime.

Effective Incident Management

SRE teams are now responsible for mission and revenue critical systems that must be up at all times. While managing operations, SRE teams have to maintain high-availability and minimize service impact of lifecycle functions. The teams are responsible for not only maintaining the application but the underlying supportive infrastructure. Troubleshooting these complex environments can lead to high stress and impaired decision-making cognitive functions. Therefore, it is vital to establish clear processes and procedures for incident management and maintenance activities. Outages are inevitable. The goal should be to minimize the impact of outages and prevent them from happening again. Following a system failure, teams should emphasize on documenting events, timelines, and lessons and away from team members. Focusing on automating repeatable tasks will enhance efficiencies and reduce manual errors.

Review Culture

Implementing a “review” culture in development and operational phases drive high-quality releases and minimize service outages. In a collaborative setting, development teams should have clear guidelines and metrics in place to identify completion and success of application updates. For example, an active code review team will identify potential problems early on in the development process. The earlier a problem is detected, the sooner it is fixed and avoids significant integration issues at a later time. Ensuring SRE’s are part of these review conversations will drive greater effectiveness of the teams and more quality code releases. Balanced feedback is also essential to establishing a productive review culture. Too often, quality assurance processes are laser-focused on finding mistakes but do a poor job recognizing quality work. By providing balanced feedback, both the development and SRE teams continually strive to raise the bar of the team.

Advocate Team Learning

Team members want to know they are part of something special. Establishing a mentorship culture will allow for seasoned team members to share their knowledge and grow less experienced staff while maintaining high-quality output. Team members should enable senior staff to make some critical decisions without managerial oversight. The approach will avoid micro-management, and allow managers to focus more on strategic initiatives. A distributed knowledge base will also significantly reduce the “key person risk” and drive a consistent operating and development model.

As Director of VMware Advisory Transformation Services, Imran is responsible for the Advisory Transformation Services (ATS) team supporting Government, Education and Healthcare.

Important Considerations for SRE

Related Articles

Why you should consider adding serverless compute capability to existing applications

Three Types of Code, a DevOps Model