Companies that are embracing cloud-native practices are also standing up Site Reliability Engineering (SRE) in their organizations. They are defining the right metrics to build resilient systems, and they are automating to eliminate toil.
Royal Bank of Canada (RBC), one of Canada’s largest banks serving 16 million clients worldwide, is a prime example of taking SRE seriously. In a session at SpringOne Platform, RBC spoke about its mission to provide next-generation platforms and services via a multi-cloud strategy.
“We are enabling the bank to transform legacy apps into cloud native and to adopt agile principles.” —John Keenleyside, Director of Cloud Platform Engineering at RBC
The company realized that traditional operations workflows were not the best model for supporting cloud-native applications; i.e. waiting for a user to log a help desk ticket and the ensuing lengthy process to resolve the problem. RBC needed to offer a platform that had high availability, supported low-latency transactions, and included proactive measures for monitoring and alerting. Enter Pivotal Platform running on VMware infrastructure.
RBC has a two-pizza-box platform team managing four Pivotal Platform foundations hosting more than 7200 application instances from more than 200 associated application teams. How does this small platform team do it all?
Making informed decisions with SRE
Key to the company’s strategy is the adoption of an SRE culture, and how they are using SRE to enable informed decisions about running Pivotal Platform as a product.
“During this journey, we set our minds to applying software engineering principles to solve operational challenges associated with our on premise cloud platform.” —Ron Cuffy, Cloud Platform Engineer at RBC
RBC has taken the best practices of SRE and interpreted them for the organization. The team started by working with Pivotal engineers and educating themselves. Then they worked closely with one of their more advanced, agile development teams to define the right measures that would meet the needs of their applications and be achievable by the platform team. They suggest you should approach SRE as an ongoing journey that will never end—you will iterate and improve.
In their session, the RBC team covers in detail how they put Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to work. They also cover many best practices for measuring and alerting based on their experience.
Reducing toil with automation
With a big focus on reducing toil, the RBC cloud platform team also worked through how they would automate their processes to meet audit and compliance policies, while still being agile. Their underlying goal was to treat the platform as a product, managing a backlog for the platform.
They found Concourse to be a great solution for getting teams onboarded, repaving BOSH, and keeping the platform up to date.
“Concourse is working hard so we don’t have to.” —John Keenleyside, Director of Cloud Platform Engineering at RBC
The platform team covers how they engaged Pivotal Platform Automation to help their configurations stay in sync. They automated product configuration and upgrades by following infrastructure-as-code principles, storing configurations and versions in a Git repository. They consider any manual changes to be a bad practice now.
Their Platform Automation pipeline validates that the staged configuration matches the configuration in the Git repo.
RBC Platform Automation Pipeline with Concourse
Another way they are introducing cloud-native techniques is with chaos testing. They see this as a preventative measure that can help platform teams understand how their systems will react to failure and help them improve. They’ve introduced the practice of game days where the team role plays in the face of chaotic scenarios. This helps the team understand Pivotal Platform better and build confidence.
Check out the session for all the details.