By: Ryan Cartwright, VMware Senior Systems Engineer & Joseph Griffiths, VCDX & Solutions Architect
On October 5th, we will be hosting a Getting More Out of vRealize Automation webinar called Day 2 Operations.
Cloud Management Platforms (CMP) are top of mind for almost every organization today. The driver toward CMP’s become more pronounced as public/hybrid cloud adoption becomes mainstream. As organizations embrace environments outside their own private cloud they quickly realize that they need tools to manage the cloud in order to encourage cost savings. Each cloud environment has its own presentation layer for providing the required functions. Different formats, GUI’s, API’s and terms create pockets of knowledge inside your organization. These pockets of knowledge become costly to maintain so organizations look to CMP’s to orchestrate, automate and standardize the user experience across clouds. User experience has become a key concern for IT; it’s a realization that how the user feels about IT is just as important as how IT operates.
Orchestration vs Automation
There are many definitions of orchestration vs automation. For the purpose of this post orchestration is defined as any operation that requires user input. For example, if you have a server provisioning workflow that requires human approval before provisioning or a workflow that migrates servers between cluster but requires inputs on source and destination clusters. Automation is a self-contained operation that does not require human interaction. Common examples of automation would be load balancing of resources between members of a cluster or automatic expansion of an operating system disk when full. Automation requires a higher level of validation and error checking in order to be autonomous. All human intelligence and possible variables have to be coded into the automation. Self-healing automated architecture represents the greatest cost savings, but requires people and process to change. Orchestration represents a significant milestone toward automation.
Once a task has been orchestrated it can be evaluated to identify if automation is possible. Using orchestration as a stepping stone for automation allows tasks to be broken into bit size chunks.
Quantifying business value
Most of the cloud management platform tools available today are focused on two moments in the service lifecycle: provisioning and de-provisioning. I believe this is because it’s easy to measure the impact of automating these two moments. Almost every organization can identify how long it takes to provision a server. Here is a common method used for measuring cloud provisioning by adding up the following:
- Time to provision storage for the new server
- Time to deploy an operating system
- Time to deploy infrastructure services to the operating system
- Time to provision accounts on the operating system
- Time to deploy networking and firewall for base operation
- Time to load application server on to operating system
- Time to deploy code to application server
- Time to validate application
- Time to provide end users access to whole system
All of these metrics include potentially hundreds of steps including ITIL life cycle and many full time workers. On average customers identify it takes them a month to complete all of these steps. In order to quantify the cost of provisioning you can use this formula: Deployment time X Number of deployments a year = cost of provisioning. De-provisioning can take just as long. Normally the process of de-provisioning is not as clean and represents much more risk. De-provisioning could include:
- Removal of machine
- Removal of machine from agents and authentication domains
- Removal of machine from firewall rules and networking
- Removal of machine from load balancers
- Removal of machine from CMDB
- Removal of machine from asset management
- Removal of machine from documentation
Almost all of these tasks represent risk since they require touching production environments to complete. Most of these steps are ignored because it’s very hard to identify which changes are connected directly to the server being decommissioned. Firewall rules and network configuration has a tendency to remain out of fear of removal. Consistency is notoriously bad with de-commissioning steps are often missed leaving an increasing pile of technical debt. Both of these processes are ripe for orchestration and automation but they represent a small portion of the life of a server.
Diagram 1.1 illustrates the problem with focusing solely on day one or provisioning in your cloud management platform. The average server will spend less than 1% of its total life in provisioning and de-provisioning. It will spend years inside day two operations. During the provisioning and de-provisioning stages the service does not provide value to the business. It is only during the day two phase that it provides value to the business. This period represents the greatest rate of change and the greatest risk. Creating day two operations to standardize reduces risk and improves service availability. Day two operations are the greatest cost to IT organizations in the form of staff hours. Day two operations are fertile ground for operating expense savings and optimization. In order to quantify the cost of day two operations you need to have understanding of the following:
- How often the task is done
- How much time the task takes to complete
- The steps and people required to complete the task
- Complexity of the task
- Risk associated with failure to complete the task in a timely fashion
These metrics allow you to quantify the value of orchestrating or automating any particular task. Once you have an understanding of the cost of the task justifying the time to automate becomes easy. These data points are often mined from ITSM tools. Tickets can track time to resolution and frequency of operations. It’s often easier to rewrite the process instead of duplicating it with orchestration. Years of caked on process make the original requirements often unrecognizable. Business value of orchestration and automation fall into a few categories:
- A decrease in time to mission objectives
- An increase in productivity or effectiveness
- A reduction in operating expenses
- A reduction in risk
It is critical that you identify the value of each task in these terms with associated cost savings. You can then quantify the value of each solution year over year to justify continued automation.
Storage Expand example
The getting more out of webinar vRealize Automation day two operations provides a simple example of the power of vRealize Automation when combined with vRealize Orchestrator. Each product addresses different logical functions to form a solution for day two operations. vRealize Automation provides service catalog and approval mechanisms while orchestration provides the programmatic functionality and API extensibility.
The example is simple:
- The virtual machines are thin provisioned
- The datastores are sized to support the currently allocated storage (thin) plus 20% overhead
- The datastores are expanded on the array when free space is below 20%
- Failure to expand the datastore results in server outages if the datastore runs out of space (VM in stunned and suspended until write space is available)
The process is done every morning and anytime that a datastore drops below 20% free and alarms.
The simplified process is provided in Diagram 1.2.
The total amount of staff time spent manually doing the task each morning is 60 minutes across two different full time employees. This time does not mean it takes 60 minutes to complete it could take hours depending on other issues facing either team. During the time you are awaiting completion of the tasks you are in risk out outages. This problem resurfaces potentially many times a day sucking away staff time and efficiencies. Staff becomes very reactive awaiting failures instead of proactively solving the issues. In addition, incorrectly expanding a datastore with a raw device mapping device can result in data loss to production systems.
The business value for this workflow can be defined as:
- 60 Minutes per day of savings in full time staff – reduction in operating expenses
- Cost of interruption of staff – An increase in productivity
- Risk of outages by failing to expand the datastore – A reduction in risk
- Risk of data loss and outages by incorrectly expanding datastore – An improvement to productivity and reduction in operating expenses
The whole process is perfect for orchestration and eventually automation. In order to orchestrate we need to understand each step with inputs and outputs. Identification of this full process can stall or completely stop automation efforts. Organizations who have been successful with orchestration have chosen to focus on the minimally viable solution and then iterate through the process until it’s automated. For this example, we have broken it into three implementation steps:
- Calculate new size and open a ticket for storage team to expand
- Storage team expands the LUN
- Virtualization team expands the datastore to the new size of the LUN
Each of these workflows can later be combined into a single workflow to automate the whole process. Each step should include ITIL processes for change. For the example I skipped the demonstrating the storage team expansion since it’s vendor specific.
Calculate new size
The process to calculate the new size requires two variables:
- Resize_percent – This is the percent of free space we expect to have on a datastore at all times
- Vmfs_to_expand – this is an composite array to hold the datastore names and size that need to be expanded
These variables feed into our process:
- Scan all datastores
- If any datastores is beyond 100 – resize_percent used then
- Calculate new size of lun to equal 70% used in GB
- Load the new size and name of datastore into array vmfs_to_expand
- Potentially add additional information into vmfs_to_expand like WWID etc..
- Open a ticket for the storage team to expand the lun
Notice how the first step of this orchestration assumes that the current process of manually expanding the LUN will continue. We are just going to automate submitting the request. This workflow alone saves us 20 minutes every day and potential errors. The code is simple and commented to add to understanding:
Once combined with a method to open tickets into your ITSM solution you have a 20 minute per day time saver. It’s also modular allowing you reuse it and combine with future workflows.
Expanding the datastore
Expanding the datastore represents the greatest risk of the process. When expanding a datastore it provides a list of all LUN’s that currently don’t have VMFS on them. This is not a problem unless you have raw device mappings which then show up in the expand list. Using them for the expand has two effects:
- Data loss on the raw device drive
- Your datastore you are expanding now uses extents to cross two luns with potential performance and disaster recovery challenges
Neither of these effects are desirable, but human error makes both very possible. Using vRealize Automation it’s possible to provide operations staff a list of datastores and a workflow to expand the datastores removing all risk from the process and saving 20 minutes per day. Choosing to expand a datastore that has not been expanded on the storage array has no effect. The first step is to build a workflow that takes a datastore name and attempts to expand it.
In order to make this user friendly we need to present a drop down of datastores inside vRealize Automation for the user to select. This is done with a vRealize Orchestrator action with the following code:
Using the action to provide the drop down allows you to have a user friendly safe method for expanding a datastore. We have now saved 35 minutes a day between two orchestrations. (Only 35 because you need 5 minutes to run the workflow)
Beyond these two workflows
If we can build orchestration to expand the storage LUN we then have every part of this task orchestrated. They could be combined to create an automation that can run every hour expanding datastores as needed. In order to automate a number of additional variables need to be considered:
- Does the storage array have enough disk space for the expand?
- Is there a maximum size for your datastores? If yes, what does the automation do in this situation?
- Is there some storage that cannot be expanded for example NFS? What should we do with that storage?
- Is there an organizational maximum number of virtual machines that we want on a single LUN? What should we do if that is exceeded?
- Do you have public cloud resources that have to be treated differently?
Each of these questions lead into additional orchestration that leads to automation. Testing each automation as an orchestration allows us to vet out these challenges. Error checking and input validation with scriptable tasks are critical to solid well running automation. Using these techniques organizations can quickly gain operational cost savings from their cloud management platform across public and private clouds.
Call to Action
Hopefully this information gives you some ideas on what is possible with Day 2 Operations using vRealize Automation. Join us on October 5th for the presentation and Q&A discussion afterward! Click here to Register!
Visit www.vmware.com/go/getmore to view the entire Getting More Out of VMware webinar series.