Home > Blogs > VMware CloudOps > Tag Archives: automation

Tag Archives: automation

Aligned Incentives – and Cool, Meaningful New Jobs! – In the Cloud Era

By: Paul Chapman, VMware Vice President Global Infrastructure and Cloud Operations

Transforming IT service delivery in the cloud era means getting all your technical ducks in a row. But those ducks won’t ever fly if your employees do not have aligned incentives.

Incentives to transform have to be aligned from top to bottom – including service delivery strategy, operating model, organizational construct, and individual job functions. Otherwise, you’ll have people in your organization wanting to work against changes that are vital for success, and in some cases almost willing for them to fail.

This can be a significant issue with what I call ‘human middleware.’ It’s that realm of work currently done by skilled employees that is both standard and repeatable at the same time: install a database; install an operating system; configure the database; upgrade the operating system; tune the operating system, etc..

These roles are prime for automation and/or digitization – allowing the same functions to be performed more efficiently, more predictably, game-changingly faster, and giving the IT organization the flexibility it needs to deliver IT as a Service.

Of course, automation also offers people in these roles the chance to move to more meaningful and interesting roles – but therein lies the aligned incentive problem. People who have built their expertise in a particular technology area over an extended period of time are less likely to be incentivized to give that up and transition to doing something ‘different.’

Shifting Roles – A VMware Example

Here’s one example from VMware IT – where building out a complete enterprise SDLC instance for a complex application environment once took 20 people 3-6 weeks.

We saw the opportunity to automate the build process in our private cloud and, indeed, with blueprints, scripting, and automation, what took 20 people 3-6 weeks, now takes 3 people less than 36 hours.

But shifting roles and aligning incentives was also very critical to making this happen.

Here was our perspective: the work of building these environments over and over again was not hugely engaging. Much of it involved coordinating efforts and requesting task work via ticketing systems, but people were also entrenched in their area of expertise and years of gained experience, so they were less inclined to automate their own role in the process. The irony was that in leveraging automation to significantly reduce the human effort and speed up service delivery, we could actually free people up to do more meaningful work – work that in turn would be much more challenging and rewarding for them.

In this case, employees went from doing standard repeatable tasks to high order blueprinting, scripting, and managing and tuning the automation process. In many cases, though, these new roles required new but extensible skills. So in order to help them be successful, we made a key decision: we would actively help (in a step-wise, non-threatening, change-management-focused way) the relevant employees grow their skills. And we’d free them up from their current roles to focus on the “future” skills that were going to be required.

Three New Roles

So there’s the bottom line incentive that can shift employees from undermining a transformation to supporting it: you can say, “yes, your role is changing, but we can help you grow into an even more meaningful role.”

And as automation frees people up and a number of formerly central tasks fall away, interesting new roles do emerge – here, for example, are three new jobs that we now have at VMware:

  •  Blueprint Designer – responsible for designing and architecting blueprints for building the next generation of automated or digitized services.
  •  Automation Engineer – responsible for engineering scripts that will automate or digitize business process and or IT services.
  •  Services Operations Manager – responsible for applications and tenant operation services in the new cloud-operating model.

The Cloud Era of Opportunity

The reality is that being an IT professional has always been highly dynamic. Of the dozen or so different IT positions that I’ve held in my career, the majority don’t exist anymore. Constant change is the steady state in IT.

Change can be uncomfortable, of course. But given its inevitability, we shouldn’t – and can’t – fight it. We should get in front of the change and engineer the transformation for success. And yet too frequently we don’t – often because we’re incented to want to keep things as they are. Indeed, misaligned incentives remain one the biggest impediments to accelerating change in IT.

We can, as IT leaders, shift those incentives, and with them an organization’s cultural comfort with regular change. And given the positives that transformation can bring both the organization and its employees, it’s clear that we should do all we can to make that shift happen.

Major Takeaways:

  • Aligning incentives is a key part of any ITaaS transformation
  • Automation will eliminate some roles, but also create more meaningful roles and opportunities for IT professionals
  • Support, coaching, and communication about new opportunities will help accelerate change
  • Defining a change-management strategy for employee freedom and support for their transition are critical for success

Follow @VMwareCloudOps and @PaulChapmanVM on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

The Top 10 CloudOps Blogs of 2013

What a year it’s been for the CloudOps team! Since launching the CloudOps blog earlier this year, we’ve published 63 items and have seen a tremendous response from the larger IT and cloud operations community.

Looking back on 2013, we wanted to highlight some of the top performing content and topics from the CloudOps blog this past year:

1. “Workload Assessment for Cloud Migration Part 1: Identifying and Analyzing Your Workloads” by Andy Troup
2. “Automation – The Scripting, Orchestration, and Technology Love Triangle” by Andy Troup
3. “IT Automation Roles Depend on Service Delivery Strategy” by Kurt Milne
4. “Workload Assessment for Cloud Migration, Part 2: Service Portfolio Mapping” by Andy Troup
5. “Tips for Using KPIs to Filter Noise with vCenter Operations Manager” by Michael Steinberg and Pierre Moncassin
6. “Automated Deployment and Testing Big ‘Hairball’ Application Stacks” by Venkat Gopalakrishnan
7. “Rethinking IT for the Cloud, Pt. 1 – Calculating Your Cloud Service Costs” by Khalid Hakim
8. “The Illusion of Unlimited Capacity” by Andy Troup
9. “Transforming IT Services is More Effective with Org Changes” by Kevin Lees
10. “A VMware Perspective on IT as a Service, Part 1: The Journey” by Paul Chapman

As we look forward to 2014, we want to thank you, our readers, for taking the time to follow, share, comment, and react to all of our content. We’ve enjoyed reading your feedback and helping build the conversation around how today’s IT admins can take full advantage of cloud technologies.

From IT automation to patch management to IT-as-a-Service and beyond, we’re looking forward to bringing you even more insights from our VMware CloudOps pros in the New Year. Happy Holidays to all – we’ll see you in 2014!

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Understanding Process Automation: Lean Manufacturing Lessons Applied to IT

by: Mike Szafranski

With task automation, it is pretty simple to calculate that it is worth taking 2 hours to automate a 10-minute task if you perform that task more than 12 times. Even considering the fixed and variable costs of the automation solution, the math is pretty straightforward.

But the justification for automating more complex processes composed of dozens of ‘10 minute tasks’ completed by different actors – including the inevitable scheduling and wait time between each task – is a bit more complex. Nonetheless, an approach exists.

You can find it laid out in Kim, Behr, and Spafford’s modern classic of business fiction, The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win [IT Revolution Press, 2013], in which the authors show how the principals of lean manufacturing are directly applicable to IT process automation.

So what lessons do we learn when building a case for process automation by applying lean manufacturing principles to IT Ops? Let’s take a look.

Simple Steps Build the Business Case

First, you need to break the process you’re interested in into its constituent parts.

Step 1 – Document Stages in the Process and Elapsed Time. Through interviews, identify major process stages and then document the clock time elapsed for each. Note, use hard data for elapsed time if possible. People involved in the process rarely have an accurate perception of how long things really take. Look at process artifacts such as emails, time stamps on saved documents, configuration files, provisioning, or testing tool log files to measure real elapsed time.

Step 2 – Document Tasks and Actors. Summarize what gets accomplished at each stage and, most importantly, detail all the tasks and record which teams perform them. If a task involves multiple actors working independently with a handoff, that task should be broken down into sub-tasks.

Step 3 – Document FTE Time. Record the work effort required for each task. We’ll call that the Full Time Equivalent (FTE). This is the time it takes to do the actual task work, assuming no interruptions, irregularities, or rework.

Step 4 – Document Wait Time. Understanding wait time is critical to building a case for process automation. If actors are busy, or if there are handoffs between actors, then elapsed time is often multiple times longer than FTE time. This is because at each handoff, the task must sit in queue until a resource is ready to process the task.

After taking these steps, you can summarize in a chart similar to this.

In Lean Manufacturing, the concept of wait time or queue time has a mathematical formula [see chapter 23 of The Phoenix Project]. The definition is:

The formula, of course, offers hard proof of what you already knew – that the busier you are, the longer it takes to get new work done. With multiple actors on a task, each can contribute to wait time, with the amount they contribute depending on how busy they are.

In the example below, there are five separate teams (security, network, dev, QA and VM) involved in the Validate Firewall step in the flow. Each team is also busy with other tasks. 

Figure 2. In a manually constructed environment, the network settings, firewall rules, and application ports need to be validated. More often than not, they need to be adjusted due to port conflicts or firewall rules. Wait times correlate strongly with % ultilization.

As you can see, the time spent by FTEs is 5.5 hours, which is only around 15% of the clock time. Clearly, with complex tasks, FTE is only a part of the story.

Step 5 – Account for Unplanned Work. Unplanned work occurs when errors are found, requiring a task from an earlier step in the process to be reworked or fixed.

In complex automation, unplanned work is another reality that complicates the process and increases FTE time. It also dramatically impacts clock time – in two ways. First, there’s the direct impact of additional time spent waiting for the handoff back upstream in the process. Second, and even more dramatic, is the opportunity cost. Planned work tasks need to stop while the process actor sets things aside and addresses the unplanned work. Unplanned work can thus have a multiplier effect, causing cascading delays up and down the process flow.

One aim of automation, of course, is to reduce unplanned work – and that reduction that can also be calculated, further adding to the business case for process automation. Indeed, studies have shown that, currently, unplanned work consumes 17% of a typical IT budget.

Process Automation Can Offer More Than Cost Reduction

But there’s potentially even more to the story than a complete picture of IT work and detailed accounting of reduced work effort and timesavings. The full impact of process automation can include:

  • Improved throughput
  • Enabling rapid prototyping
  • Higher quality
  • Improved ability to respond to business needs

The cumulative impact of these can be substantial. Indeed, it can easily exceed the total impact of direct cost reductions.

Step 6 – Estimate total benefit to business functions. If calculating the value of reducing FTE, wait times, and unplanned work is relatively straight forward, figuring the full business impact of reducing overall calendar time for a critical processes (from 4 weeks to 36 hours, say) requires more than a direct cost reduction calculation. It’s worth doing, though, because the value derived from better quality, shorter development times, etc., can substantially exceed the value of FTE hours saved through automation (see figure 3). 

Figure 3. The secondary impacts of automating processes and increasing agility and consistency can be much larger than the value of the FTE hours saved.

You do it by asking IT customers to detail the benefits they see when processes are improved. There are many IT KPIs that can help here, such as the number of help desk tickets received in a specific period, or the number and length of Severity 1 IT issues.

We used this method at VMware when we automated dev/test provisioning and improved the efficiency of 600 developers by 20%. We achieved a direct cost reduction related to time and effort saved. But we found an even bigger impact, even if it was harder to quantify, in improved throughput, in always being able to say, “Yes” to business requests, and in enabling rapid prototyping.

Lessons Learned

With these steps, you can capture major process stages, tasks, actors, calendar time, work effort, and points of unplanned work, quantifying the business value of automating a process end-to-end – and making your case for end-to-end process automation all the stronger.

Key takeaways:

  • It’s possible to make a business case for automating end-to-end IT processes;
  • You can do this by applying concepts from lean manufacturing;
  • The concepts of wait time and unplanned work are central;
  • Efficiency driven cost reduction is only part of the equation, however;
  • To quantify the full value of agility, work with IT customers to gauge improvements in KPIs that reflect improved business outcomes.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Task Automation Vs. Process Automation – Highlights from #CloudOpsChat

After a successful automation-themed #CloudOpsChat in September, we decided to take a deeper dive into automation for this month’s edition, discussing “Task Automation Vs. Process Automation.” Thanks to everyone who participated, and thank you especially to Rich Pleasants (@CloudOpsVoice), Business Solutions Architect and Operations Lead for Accelerate Advisory Services at VMware, for co-hosting!

To begin the chat, we asked: “What IT tasks or processes has your company successfully automated?”

@Andrea_Mauro jumped right in, asking how automation compares to tasks? @kurtmilne offered VMware’s take, saying “VMware IT has fully automated provisioning of complex workloads on private cloud,” and clarified that the most complex workloads were “Oracle ERP with web portals, and over 80 blueprints.” @venkatgvm also elaborated on VMware’s automation story: “VMware instance provisioning had over 20 major steps, each of them were executed by siloed teams.”

Co-host @CloudOpsVoice took the question further, asking, “Are people automating day to day maintenance activities or actual steps in the process?”

@vHamburger gave his advice on where to begin with automation, saying “[day-to-day automation] is a good starting point. Nominate your top 10 time-consuming tasks for automation.” @Andrea_Mauro replied, suggesting that “task automation is more for repeatable operations and day by day [tasks].” He followed up by offering a definition of process automation: “Process automation could be more related to organization level and blueprint usage.” @kurtmilne also chimed in with business-related definitions of task and process automation: “Task automation math includes cost/time of single task vs. developing automation capability…Process automation math includes business benefit of overall improved agility, service quality – as well as cost.” @CloudOpsVoice broke his definition of automation into three parts: “day-to-day, build and run.”

@CloudOpsVoice next asked, “What technique do you use primarily for automation? Policy, orchestration or scripting? How do App blueprints impact it?”

@kurtmilne noted the value of blueprints and scripting: “Blueprints and scripting allow app provisioning automation – not just VM provisioning.” @thinkingvirtual also offered sound advice on how to select what to automate at your company: “Always make sure your automation efforts provide real value. Don’t automate for automation’s sake.” Elaborating on this, @kurtmilne discussed the value of automation, stating that automation’s “real value” is “ideally measured in business outcomes, and not IT efficiency.” @vHamburger also warned against bottlenecks preventing automation: “every enhancement after your bottleneck is not efficient – know your bottlenecks!”

@vHamburger went on to mention task workflow: “Clean task workflow with documented steps is always preferred over scripts,” he suggested, because it’s “easier and repeatable for new admins.” @Andrea_Mauro countered by saying that sometimes a “‘quick and dirty’ solution could be good enough,” to which @vHamburger replied, “In my experience ‘quick and dirty’ always leads to fire fighting ;).” @kurtmilne then vouched for “leaning out” a process: “‘Leaning out’ an IT process is good. But sometimes it’s better to use automation to eliminate tasks vs. automate tasks,” he wrote. @thinkingvirtual also noted how important communication is to successful automation: “Often forgotten: keep your business in the loop. Show back the value continuously to broaden the relationship.”

@AngeloLuciani kept things moving by asking, “Do you pick a tool to fit the process or a process to fit the tool?”

@JonathanFrappier enthusiastically went with the latter: “Process to fit the tool! Processes can change, tools have to live on until more budget is approved!” @kurtmilne added, “Tool/process construct doesn’t make sense with full automation. You can do things with automation you can’t do with manual tasks: For example, you don’t figure out manual horizontal scaling process in cloud – then look for tool to automate.”

#CloudOpsChat ended with one last great tip (and a nod to VMworld!) from @thinkingvirtual: “Automation skills are a huge career opportunity. Don’t avoid automation, defy convention.”

Thanks again to everybody who participated in this latest #CloudOpsChat, and stay tuned for details of our next meet up. If you have suggestions for future #CloudOpsChat topics, let us know in the comments.

For more resources on automation, check out the following CloudOps blog posts below:

In the meantime, feel free to tweet us at @VMwareCloudOps with questions or feedback, and join the conversation by using the #CloudOps and #SDDC hashtags. For more from Rich Pleasants, head over to the VMware Accelerate blog.

The Critical Element of Service Delivery in the Cloud Era: Join our Webcast 11/14

As more companies aim to build the software-defined datacenter (SDDC), the importance of service definition continues to grow. Running a successful SDDC strategy means understanding service offerings, for sure. But it’s also about standardizing those offerings to achieve agility and efficiency. So where do you start? How do you know what services your company can best provide?

Join Product Manager Jason Holmberg and Business Solutions Architect Rohan Kalra on Thursday, November 14th  at 10am PT for their BrightTalk webcast: The Critical Element of Service Delivery in the Cloud Era. The webcast will take you through the four fundamental Service Catalog building blocks:

  • Automation
  • Governance & Policies
  • Provisioning and orchestration
  • Lifecycle management

Both Jason and Rohan have years of experience building and implementing service catalogs. In addition to defining these building blocks, Jason and Rohan will dive into the requirements for each component, making it easier for you to implement a service catalog and making sure that you’re delivering the best services to your users through your catalog. Don’t miss this webcast to learn how service definition will be the key to your SDDC.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Task Automation vs. Process Automation: Join Us For #CloudOpsChat 11/13!

Here at VMware, we’re always talking about automation: Venkat Gopalikrishan detailed his success after automating the provisioning of business-critical application stacks, and Paul Chapman introduced VMware’s IT transformation story by highlighting the importance of automation and change management.

We saw some fantastic insight on automation from many of you during our last #CloudOpsChat and wanted continue the conversation. For this month’s #CloudOpsChat, we’re specifically focusing on next steps in automation by asking the following questions: What has your company successfully automated? Do you focus on task automation, process automation, or both?

Join us on Wednesday, November 13th at 11am PT to discuss task vs. process automation with your CloudOps peers. Hosting the chat is CloudOps expert Rich Pleasants, Business Solutions Architect and Operations Lead for Accelerate Advisory Services at VMware.

During the chat, we’ll discuss:

  • What business tasks or processes has your company successfully automated?
  • When discussing automation, how do you determine whether you should automate a task vs. a process?
  • Do you have people in your company whose primary role is automation?
  • What technique do you use primarily for automation? Policy, orchestration or scripting?
  • How does the blueprint concept impact your automation workflow (scripting, orchestration)?

Here’s how to participate in #CloudOpsChat:

  • Follow the #CloudOpsChat hashtag (via Twubs.com, Tchat.io, TweetDeck, or another Twitter client) and watch the real-time stream.
  • On Wednesday, November 13th at 11am PST, @VMwareCloudOps will pose a few questions using the #CloudOpsChat hashtag to get the conversation rolling.
  • Tag your tweets with the #CloudOpsChat hashtag. @reply other participants and react to their questions, comments, thoughts via #CloudOpsChat. Engage with each other!
  • #CloudOpsChat should last about an hour.

In the meantime, RSVP to the event and feel free to tweet at us at @VMwareCloudOps with any questions you may have! For even more on automation, check out Rich Pleasants’ latest blog post for VMware Accelerate where he discusses “Intelligent Automation.”

We look forward to seeing you in the stream!

To Automate or Not to Automate? – Highlights from #CloudOpsChat

Last week, we held another successful #CloudOpsChat, this time asking: “To Automate or Not to Automate?” Thank you to everyone who participated in the lively conversation, and especially to our two co-hosts, Cloud Operations Architects Andy Troup (@HarrowAndy) and David Crane (@DaveJCrane)!

To start things off, we asked, “How do you define automation?”

Our co-hosts jumped in first, with @HarrowAndy stating, “automation = stop doing repeatable tasks,” and @DaveJCrane remarking on how he asked the same question during a group discussion at VMworld and received 50 different answers from 50 different people in the room! In addition, the notion that automation implies the removal of manual work was a prominent theme, with @Seemaj, @AngeloLuciani, @tcrawford and @KongYang agreeing that automation means less, or no, human intervention (see Pierre Moncassin’s take on that here).

Next, the conversation moved on to the importance of defining automation within the context of your business.

@DaveJCrane began by adding a layer to the definition, suggesting that it is “important to consider the definition of automation in context of the business environment, not just process focus.” @tcrawford agreed with David, specifying a difference between the what/why of automation, as well as the how/when. @HarrowAndy built upon @tcrawford’s response, adding that there must always be a benefit to what you’re automating, and that there is “no point automating something you only do infrequently.”

@Seemaj then brought up the cost of automation, agreeing with @tcrawford that: “There is a cost to automation, and the business drives those decisions.”

@AngeloLuciani stated that “automation drives business value,” and @tcrawford stirred the pot, replying “sometimes it can, not always.” @HarrowAndy then brought up the importance of weighing automation’s benefits with its costs, with @KongYang, @AngeloLuciani and @Seemaj adding that two of the biggest benefits to automation are limiting human mistakes and delivering services faster. @Gnowell1 emphasized automation’s goal of promoting reliable service delivery, saying “time consuming, complex tasks should also be considered for automation.”

After that, @KalraRohan asked, “What’s driving everyone to move towards automation?”

@VmwDavidH immediately offered VMware’s use case for automation: “For us, we have cut our dev environment provisioning time down from weeks to hours.” @Seemaj noted business agility as her main reason, with @AngeloLuciani saying that automation is a “building block” towards the software-defined datacenter (SDDC). @DaveJCrane agreed, adding that “[automation] is always good to implement as part of a larger ops transformation.”

@KalraRohan then asked, “What are the operational impacts of automation? What are best practices?”

@HarrowAndy, @VmwDavidH and @AngeloLuciani all agreed that a set of orchestration tools was essential in driving the success of automation. @GNowell1 suggested a key benefit that automation provides to a business: “SDDC automation promotes Ops standards. Administrators spend more time on higher level responsibilities.” And @DaveJCrane elaborated further on automation’s ability to shift ops’ focus: “Automating allows you to put more emphasis on the workflow/approval process.”

To close out this #CloudOpsChat, @HarrowAndy asked: “So what have you all automated? Is it just provisioning activities, or are there other things?”

@AngeloLuciani and @Gnowell1 had both started with provisioning and said they were looking for the next step in automation. @CloudOpsVoice stated that provisioning was a great start and great use-case for the ‘run’ side of automation, with @tcrawford adding “iterative automation is all about value.” He continued by saying that knowing what to automate next comes with “experience, and asking questions.” @Seemaj agreed, and emphasized that automation touches all aspects of a company: “Automation is not just about provisioning/tools/scripts…and it does not always have measurable outcomes. Sometimes benefits are soft benefits, e.g. improved user experience.”

Our #CloudOpsChat wrapped up with a positive outlook on the future of automation, with @AngeloLuciani tweeting “automation will be a major skill for next gen IT staff.” As automation progresses, companies will experience “less firefighting in operations and more time spent on working with the business,” suggested co-host @HarrowAndy.

Thanks again to everybody who participated in this latest #CloudOpsChat, and stay tuned for details of our next #CloudOpsChat!

In the meantime, feel free to tweet us at @VMwareCloudOps with questions or feedback, and join the conversation by using the #CloudOps and #SDDC hashtags.

To Automate, or Not to Automate? Join Us For #CloudOpsChat 9/18!

We talk about automation regularly here on the CloudOps blog – Kurt Milne looked into the economics of task and service automation, Andy Troup broke down the automation Scripting, Orchestration and Technology Love Triangle, and, more recently, Pierre Moncassin discussed how automation doesn’t always mean removing the human element from your workflows. Automation, of course,  also continues to be a hot topic for our customers.

For our next #CloudOpsChat on Wednesday, September 18th at 11am PST, we’d like to invite our CloudOps audience to keep the conversation going, discussing: “To Automate, or Not to Automate? - considerations and best practices for IT admins looking to take advantage of the benefits of automation in a smart but effective way.”

Co-hosting the chat will be our very own CloudOps bloggers Andy Troup and David Crane, both Cloud Operations Architects at VMware.

During the chat, we’ll discuss:

  • What approach have you taken to identify suitable processes to automate?
  • What process areas have you started automating?
  • What technique is primarily used for your automation? Policy, orchestration or scripting?
  • What makes a process a good candidate for automation?
  • What challenges has your organization faced when approaching automation?
  • What business processes have you successfully automated in your organization?

Here’s how to participate in #CloudOpsChat:

  • Follow the #CloudOpsChat hashtag (via Twubs.comTchat.io, TweetDeck, or another Twitter client) and watch the real-time stream.
  • On Wednesday, September 18th at 11am PST@VMwareCloudOps will pose a few questions using the #CloudOpsChat hashtag to get the conversation rolling.
  • Tag your tweets with the #CloudOpsChat hashtag. @reply other participants and react to their questions, comments, thoughts via #CloudOpsChat. Engage with each other!
  • #CloudOpsChat should last about an hour.

In the meantime, RSVP to our #CloudOpsChat and feel free to tweet at us at @VMwareCloudOps with any questions you may have. We look forward to seeing you in the stream!

The Paradox of Re-startable Workflows: A More Efficient, Automated Process Does Not Always Mean Removing the Human Element

By: Pierre Moncassin

A chance conversation with a retired airline captain first brought home to me the paradox of automation. It goes something like this: Never assume that complete automation means removing the human element.

The veteran pilot was adamant that a commercial aircraft could be landed safely with the autopilot – but, he explained, contrary to what some people believe, that does not mean the human pilot can just push a button and sleep through the landing. Instead, it means that the autopilot handles the predictable, routine elements of the landing while the pilot plays the vital role of supervising the maneuver and reacting to any unforeseen situations.

We’ve seen a similar paradox at play in workflow automation situations faced by some of our enterprise customers. Here’s a typical scenario: A customer has deployed an automated provisioning workflow using VCO along with vCD and/or VCO. They have relied on VCO scripting to automate the provisioning steps so that end users can provision infrastructure just by “pushing a button.” As with the aircraft autopilot (though hopefully less life-threatening), the automated workflows work well until an unexpected situation occurs – there’s an error in the infrastructure, a component with a key dependency changes, or the key dependency itself changes.

This often means a failed workflow, and sometimes an error message that the end user struggles to interpret. After a couple of “failed workflow” experiences, the end user is quickly discouraged, user satisfaction plummets and…  need I say more?

Well, this is not what automation is supposed to be all about – We want maximum user satisfaction. The missing element here is an error recovery mechanism, one that very often involves human intervention. So how does that work?

One approach, in terms of VCO workflows, is to build in error handling into the workflows. It is not possible to predict all error situations, of course, but it is possible to detect error situations and issue an error message to an administrator; this at least enables the interception of the condition, which maybe simple to fix.

A second and more advanced part of the solution is to build modular scripts – that way you are fixing the problems once only and, of course, making your scripts more robust and repeatable over time.

The third part of the solution is to build re-startable workflows. This essentially means giving an administrator or process owner the ability to undo steps at any point in the flow. In the case of a straight-forward VM provisioning workflow, the solution might be as simple as removing the VM and automatically restarting the workflow from the beginning.

Or, it could be more complex – perhaps your resources have run out (maybe additional storage needs provisioning), or an issue arises with network settings. In these cases, you may need to troubleshoot before the workflow can re-start. But the point remains the same: A re-startable workflow gives your end users the best chance to complete their original request, rather than stare at an error message.

With error detection, you can roll back to the initial state and flag the error. Once the error is resolved, the administrator can either “resume” or restart from that known point with a known configuration, or at least no worse knowledge than you had before.

Crucially, all the error and exception handling is hidden from the user. That allows the request to complete (or to at least have a better chance of completing) – making for a much better experience for the end user.

It is up to the script designers to decide how much of the error they want to share with the end users – a decision that should be made with the administrator responsible for overseeing the process and responding to exceptions. The goal, though, is to keep end users happy and blissfully unaware of error situations as long as their request is satisfied!

To reiterate my original point: Despite the apparent automaticity of these resolutions, they will have been the result of human intervention along the way.

Finally, as a further step towards optimum organization, I recommend looking at the broader picture of governance around the cloud-related processes. How does the resolution team interact with the Service Desk, for example? Are there policies about when to re-provision instead of repair? Is there a specific organization to manage the cloud-based services? See our whitepaper “Organizing for the Cloud” for an introduction to optimizing the whole IT organization to leverage a cloud infrastructure.  But I digress…

In summary – if you are worried that workflow failures may impact your end users:

  • Build resilience in your VCO workflows and related scripts
  • Build in mechanisms to facilitate human resolution for unpredictable situations
  • Create re-startable VCO workflows
  • Identify a process owner who has responsibility and accountability for managing exceptions and errors

Thank you to my colleague David Burgess, who helped me formulate several of the key ideas in this post.

For more, browse our blog for some of our previous posts on automation, and join our upcoming automation #CloudOpsChat on 9/18 with Andy Troup and David Crane!

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Clouds are Like Babies

By: Kurt Milne

While preparing for the Datamation Google+ hangout about hybrid cloud with Andi Mann and David Linthicum that took place last week, I referred to Seema Jethani’s great presentation from DevOps Days in Mountain View.

Her presentation theme, “Clouds are Like Babies,” was brilliant: Each cloud is a little different, does things its own way, speaks its own language and of course, brings joy. Sometimes, however, clouds can also be hard to work with.

Her great examples got me thinking about where we’re at as an industry with respect to adopting hybrid cloud, and the challenges related to interoperability and multi-cloud environments.

My guess is that we will work through security concerns, and that customers with checkbooks will force vendors to address technical interoperability issues. But then we will realize that there are operational interoperability challenges as well. In addition to cloud service provider decisions to use the AWS API set, there are tactical nuances that make having a single runbook for cloud tasks difficult across platforms.

From her presentation:

  • Cloudsigma requires the server to be stopped before making an image
  • Terremark requires the server to be stopped for a volume to be attached
  • CloudCentral requires the volume attached to the server in order to make a snapshot

The availability of various functions common in standard virtualized environment varies widely across cloud service providers – such as pausing a server, creating a snapshot, creating a load balancer, etc.

We don’t even have a common lexicon to describe a “Machine image” in AWS. VMware calls it a “Template vApp,” Openstack calls it an “Image,” and CloudStack call it a “Template.”

So in an Ops meeting, if you use an OpenStack-based public cloud and a private cloud based on CloudStack, and you say “we provision using templates, not images,” and someone from another team agrees that they do that too, how do you know if they know that you are talking about different things? It confuses me even writing the sentence.

I led a panel discussion on “automated provisioning” at DevOps Days. Due to templates/images/blueprint terminology confusion, we ended up using the terms “baked” (as in baked bread) to refer to provisioning from a single monolithic instance, and “fried” (as in stir-fried vegetables) to refer to building a release from multiple smaller components, assembled before provisioning – just to discuss automation!

Bottom line: Why not avoid all the multi-cloud hybrid-cloud interoperability and ops mishmash and use the vCloud Hybrid Service for your public cloud extension of VMware implementation?

Don’t miss my sessions at VMworld this year:

  • “Moving Enterprise Application Dev/Test to VMware’s internal Private Cloud” with Venkat Gopalakrishnan
  • “VMware Customer Journey – Where Are We with ITaaS and Ops Transformation in the Cloud Era?” with Mike Hulme

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps, #SDDC, and #VMworld hashtags on Twitter.