Engineer touching laptop to check and control welding robotics automatic arms machine in intelligent factory automotive industrial with monitoring system software. Digital manufacturing operation.Industry 4.0
Professional Services Transformation Consulting

Improve Operational Efficiency with Site Reliability Engineering and Automation

As organizations deploy the latest and greatest infrastructure technologies, teams must also modernize their operations model (people and process) to accommodate technological changes and provide maximum value. Consider adding site reliability engineering and automation to your future plans to improve operational efficiency, and ultimately, value.

The value of site reliability engineering

Consider this example: Your infrastructure team decides to transform your existing infrastructure (IaaS) into a VMware Cloud Foundation-based IaaS environment to manage a centralized, hyperconverged infrastructure. Your goal is to accelerate your application provisioning, provide consistent automated lifecycle management, and secure infrastructure and operations across your infrastructure.  

To achieve this automated lifecycle management level, a consistent runbook is a critical component of infrastructure operations. While it’s good to maintain and follow a runbook, today’s world of changing business needs and fast-paced technology advancement also requires effective scaling of your IaaS environment. Teams can scale with guidance from a site reliability engineer (SRE). An SRE executes reviews and checks as well as verifies the correct versioning of tools for your environment using the VMware suite of products.

The SRE focuses on improving speed, reliability and quality of infrastructure services. The SRE applies software development and engineering principles and practices to infrastructure and cloud operations using an iterative and agile approach. SRE teams use software as a tool to manage systems, solve problems and automate operations tasks. They collaborate with engineers or ops teams on operational tasks to solve problems and manage production systems. 

A runbook should align with a dynamic responsibility assignment matrix, or RACI chart, and identify the “what,” “who,” “frequency,” and “possibility of automation” elements. This simple table should identify all operational tasks your IaaS requires to run effectively (the “what”) and tag each task with the responsible role (the “who”) and frequency of execution (the “when”). The last column should indicate an initial evaluation of the identified task, whether it can be automated and the toolset team members will use to accomplish each task.

Automation is the secret to operational efficiency

Teams can automate these common operational areas to improve operational efficiency:

  • Automation agility (contribute, execute anywhere, execute anytime): Run any script anywhere, anytime to troubleshoot or remediate issues. This provides flexibility of coding language, a distributed ownership model and an abstracted workflow. Automation can write scripts in multiple languages and execute on any component in the fleet quickly and with minimal investment by service teams.
  • Event-driven operations (event-driven automation, self-heal or auto triage): Event-driven operations are the solution to scaling the operational activities of the service. Events are generated from multiple sources including infrastructure, scanning tools or even the initiated events, such as support tools. An event-driven operational architecture should receive these events and be able to execute operations that normally operations personnel would perform, reducing overall churn. The goal is not to replace operations personnel but to increase the component-to-operator ratio to allow the service to scale to manage larger numbers of infrastructure components and to have operations personnel focus on solving the more nuanced issues that can’t be automated.
  • Operational view of tenant (operational data, single point of entry, troubleshooting, communication): With Optimus, the goal is to operate in a simple way on single tenant SDDC without having to access the SDDC or a single-pane view of SDDC.
  • Automation tooling convergence(VCF, VMware Skyline, hyperscaler): Aim to achieve consistent SRE operational tooling, by onboarding users and processes. Here are some examples::
    • Create user in vCenter. Use the newly created user to set up a vCenter connection in vRA.
    • Create user in vROps. Use the newly created user to set up a vROps connection in vRA.
    • Create user in NSX-T. Use the newly created user to set up an NSX-T connection in vRA  
  • Upgrades and patching: Upgrading at enterprise scale manually is never a quick process and should be combined with emergency management so that automation can more easily detect pitfalls. This area focuses on automated lifecycle management of various cloud components so that the ability to quickly update and patch improves the reliability and security of the infrastructure.
  • Security hardening: The following settings should be applied on each ESXi server part of the VCF instance:
    • Installation ESXi
    • General System Settings ESXi
    • Auditing and Logging ESXi
    • Console ESXI
    • Storage Settings
    • Network Settings
    • VM Settings
  • Autoscaling (vertical and horizontal): Prepare automation that can onboard an ESXi host for VCF usage and can expand an existing cluster with the newly onboarded ESXi host. Some SDDC manager steps include:
    • Create or select network pool
    • Validate host
    • Commission host
    • Expand cluster

Teaming up with an SRE can accelerate your organization’s path to a healthier and more efficient technology environment. Plus, you can harness the full power of your technology investment to improve operational efficiency and realize value.

Learn more about SRE practices from the experience highlighted in the article, “Modern SRE Practices for Incident Management.”

Contributing to this article were Priya Pandian, Senior Manager and Site Reliability Engineer in Cloud Operations, and Vladislav Vladimirov, Senior Manager in Solutions and Global Enablement.