In a recent post: What is Site Reliability Engineering? We introduced the Site Reliability Engineering (SRE) model for managing infrastructure and provided an introduction to some of the key aspects and why business should consider adopting aspects of this model to improve the reliability of their services – SRE is not just for large web companies. Furthermore we identified some technology capabilities that are complementary to managing infrastructure using the SRE methodology. This post aims to dive a little deeper into those capabilities. The three technology capabilities discussed below can certainly be adopted in isolation to provide measurable benefit but when combined together as part of a broader SRE solution form a cohesive integrated system that can drive towards greater business outcomes.
Key themes introduced in the previous post that these technical capabilities are guided by are as follows:
- Development efforts ‘shifting to the right’ to bring value to I&O Operations
- Balancing the desired reliability with the need for software deployment velocity
Platform Standardization and Infrastructure as Code (IaC)
The objective of an IaC initiative is to design all application platforms and as much infrastructure as possible to be defined as code, as similar as possible ideally identical, and deployed using automated processes.
Standardizing on an IaC first approach to all environments will help bring consistency. Standardizing on the platform will take this consistency even further. Using a platform like Kubernetes for example will mean that development through to test environments will have a standardized .yaml based configuration interface. This can also help obfuscate the differences in underlying infrastructure.
Code, as opposed to configuration changes made in a GUI is immutable and therefore each time you apply a script or configuration file to a piece of infrastructure which is at a known state you can expect the same result – it is less susseptible to fat fingers. This repeatability can be achieved from development, through testing and into production since the same (or as near as they can be to the same) scripts and configuration files can be used to deploy the application from development through to production. This repeatable deployment consistency and hence increased reliability breeds additional confidence which in turn leads to a greater number of total deployments – increased deployment velocity.
Once we begin to define infrastructure as code we find that we accumulate a lot of scripts and configuration files. These files need to be stored somewhere, they need to be shared by, and updated by, multiple staff. This is where traditional development source control tools such as Git come in. Once this infrastructure code is stored in a repository like git we can then start to look at automating test infrastructure deployments that are triggered by new code check ins.
As development teams shift to the right I&O teams shift to the left and embrace traditional development standards and methodologies including the aforementioned version control tools, and automation tools that we’ve yet to cover. Both teams are now standardizing on similar tools and speaking the same language meaning that an I&O team member can join an application management team much earlier in the development process. Working side by side with application developers, I&O staff can assist with application design and architecture bringing the operational and reliability requirements to the forefront of the development process. With greater communication and confluence of tools and processes a new deployments be they:
- A platform change
- Infrastructure change
- Application code change
Can be completed in the same way using the same tools and same processes resulting in faster and more successful deployments with less friction.
Continuous integration and continuous deployment or CI/CD as often abbreviated to has been referred to as The crown jewel of digital transformation and with good reason. Now that we’re defining our infrastructure as code, storing it in a version control, standardized our application deployment platform we’re have a great basis for automation.
One of the key benefits of this automation of the CI/CD process is speed. This speed is made possible via a number of key qualities:
- Infrastructure defined as code can be consistently and repeatably deployed requiring minimal or ideally no manual inputs
- Consistent application platform state across development environments through to production leads to more consistent application deployments
- If humans are minimally involved in the process it’s just plain faster
The additional speed of deployment, both of the application platform and the application enables application development teams to schedule a larger number of smaller application changes.
In the paragraph above we’ve discussed automated CI & CD. In some cases this automation can lead code right through testing and into production with no human intervention. In all likelihood though an enterprise deployment system will have at least one gate before production. That ‘gate’ is going to be controlled by a release manager who at this point will be confident that:
- The change is small
- It has been developed on a platform that is consistent with the production target platform
- The deployment scripts have been tested in a staging environment that is consistent with the production platform
- It can be implemented quickly
Yet it still represents a change and as such there is a risk that even though the above points are true it’s possible the change introduces something unexpected. This is where canary releases can be useful.
A canary release involves a release manager introducing a single node containing the new code which, in the case of Kubernetes, will usually be a single container so it can respond to a small percentage of production requests. While in production this canary node is carefully tested and monitored to determine successful functionality. Once the canary release results are in, a decision can be made on weather to progress to a blue/green deployment.
Blue/green deployments progressively introduce new nodes to accept production traffic. This is achieved by leaving existing nodes to complete requests that are in progress and routing new traffic to the new nodes. Using this method the release manager can be confident that new or updated nodes can be introduced without interruption to production traffic and achieve a clean cut- over to the updated application. This could be over a matter of minutes or for more conservative releases over a period of days. The exact details of this cut-over is dependant on the application architecture.
Implementing canary releases and blue/green deployments as part of a release strategy will provide a release manager with a greater level of valuable input and greater flexibility and confidence during deployments of code updates. It is therefore important to select an infrastructure platform, such as Kubernetes, that will easily enable this functionality.
While this post focuses in on technology capabilities, the broader focus of the series of posts is SRE as a methodology and mindset. The technology is of course an important enabler of the overall outcome but technology alone e.g. just deploying Kubernetes will not necessarily achieve a successful outcome. We do highlight the capabilities of Kubernetes in this post but it is by no means essential for success. When the SRE methodology is applied correctly excellent results can be achieved with a myriad of different configuration management tools, automation systems and application platforms.
James Wirth works in the Professional Service Engineering Team designing services solutions for VMware customers. He is a proven cloud computing and virtualization industry veteran with over 10 years’ experience leading customers in Asia-Pacific and North America through their cloud computing journey. @jameswwirth