Home > Blogs > VMware Operations Transformation Services > Monthly Archives: December 2013

Monthly Archives: December 2013

The Top 10 CloudOps Blogs of 2013

What a year it’s been for the CloudOps team! Since launching the CloudOps blog earlier this year, we’ve published 63 items and have seen a tremendous response from the larger IT and cloud operations community.

Looking back on 2013, we wanted to highlight some of the top performing content and topics from the CloudOps blog this past year:

1. “Workload Assessment for Cloud Migration Part 1: Identifying and Analyzing Your Workloads” by Andy Troup
2. “Automation – The Scripting, Orchestration, and Technology Love Triangle” by Andy Troup
3. “IT Automation Roles Depend on Service Delivery Strategy” by Kurt Milne
4. “Workload Assessment for Cloud Migration, Part 2: Service Portfolio Mapping” by Andy Troup
5. “Tips for Using KPIs to Filter Noise with vCenter Operations Manager” by Michael Steinberg and Pierre Moncassin
6. “Automated Deployment and Testing Big ‘Hairball’ Application Stacks” by Venkat Gopalakrishnan
7. “Rethinking IT for the Cloud, Pt. 1 – Calculating Your Cloud Service Costs” by Khalid Hakim
8. “The Illusion of Unlimited Capacity” by Andy Troup
9. “Transforming IT Services is More Effective with Org Changes” by Kevin Lees
10. “A VMware Perspective on IT as a Service, Part 1: The Journey” by Paul Chapman

As we look forward to 2014, we want to thank you, our readers, for taking the time to follow, share, comment, and react to all of our content. We’ve enjoyed reading your feedback and helping build the conversation around how today’s IT admins can take full advantage of cloud technologies.

From IT automation to patch management to IT-as-a-Service and beyond, we’re looking forward to bringing you even more insights from our VMware CloudOps pros in the New Year. Happy Holidays to all – we’ll see you in 2014!

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

The Case for Upstream Remediation: The Third Pillar of Effective Patch Management for Cloud Computing

By: Pierre Moncassin

Patch Management fulfills an essential function in IT operations: it keeps your multiple software layers up to date, as free of vulnerabilities as possible, and consistent with vendor guidelines.

But scale that to an ever-dynamic environment like a VMware-based cloud infrastructure, and you have an extra challenge on your hands. Not only do the patches keep coming, but end users keep provisioning and amending their configuration. So how to keep track of all these layers of software?

In my experience there are three pillars that need to come together to support effective patch management in the Cloud. The first two, policy and automation, are fairly well established. But I want to make a case for a third: upstream remediation.

As a starting point, you need a solid patching policy. This may sound obvious, but the devil is in the details. Such a policy needs to be defined and agreed across a broad spectrum of stakeholders, starting with the security team. This is typically more of a technical document than a high-level security policy, and it’s far more detailed than, say, a simple rule of thumb (e.g. ‘you must apply the latest patch within X days’).

A well-written policy must account for details such as exceptions (e.g. how to remedy non-compliant configurations); security tiers (which may have different patching requirements); reporting; scheduling of patch deployment, and more.

The second pillar is Automation for Patch Management. While the need for a patching policy is clearly not specific to Cloud Infrastructure, its importance is magnified in an environment where configurations evolve rapidly and automation is pervasive. And such automation would obviously make little sense without a well-defined policy. For this, you can use a tool like VMware’s vCenter Configuration Manager (VCM).

VCM handles three key aspects of patching automation:

  1. Reporting – i.e. verifying patch levels on selected groups of machines
  2. Checking for bulleting updates on vendor sites (e.g. Microsoft)
  3. Applying patches via automated installation

In a nutshell, VCM will automate both the detection and remediation of most patching issues.

However, one other key step is easily overlooked – upstream remediation. In a cloud infrastructure, we want to remediate not just the ‘live’ configurations, but also the templates used for provisioning. This will ensure that the future configurations being provisioned are also compliant. Before the ‘cloud’ era, administrators who identified a patching issue might make a note to update their standard builds in the near future – but there would rarely be a critical urgency. In cloud environments where new machines might be provisioned say, every few seconds, this sort of updates need to happen much faster.

As part of completing any remediation, you also need to be sure to initiate a procedure to carry out updates to your blueprints, as well as to your live workloads (see the simplified process view above).

You need to remember, though, that remediating the images will depend on different criteria from the ‘live’ workload and, depending on the risk, may require a change request and related approval. You need to update the images, test that the updates are working, and then close out the change request.

In sum, this approach reflects a consistent theme across Cloud Operations processes: that the focus of activity is shifted upstream towards the demand side. This also applies to Patch Management: remediation needs to be extended to apply upstream to the provisioning blueprints (i.e. images).

Key takeaways:

  • Policy and automation are two well-understood pillars of patch management;
  • A less well-recognized third pillar is upstream remediation;
  • Upstream remediation addresses the compliance and quality of future configurations;
  • This reflects a common theme in Cloud Ops processes: that focus shifts to the demand side.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Understanding Process Automation: Lean Manufacturing Lessons Applied to IT

by: Mike Szafranski

With task automation, it is pretty simple to calculate that it is worth taking 2 hours to automate a 10-minute task if you perform that task more than 12 times. Even considering the fixed and variable costs of the automation solution, the math is pretty straightforward.

But the justification for automating more complex processes composed of dozens of ‘10 minute tasks’ completed by different actors – including the inevitable scheduling and wait time between each task – is a bit more complex. Nonetheless, an approach exists.

You can find it laid out in Kim, Behr, and Spafford’s modern classic of business fiction, The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win [IT Revolution Press, 2013], in which the authors show how the principals of lean manufacturing are directly applicable to IT process automation.

So what lessons do we learn when building a case for process automation by applying lean manufacturing principles to IT Ops? Let’s take a look.

Simple Steps Build the Business Case

First, you need to break the process you’re interested in into its constituent parts.

Step 1 – Document Stages in the Process and Elapsed Time. Through interviews, identify major process stages and then document the clock time elapsed for each. Note, use hard data for elapsed time if possible. People involved in the process rarely have an accurate perception of how long things really take. Look at process artifacts such as emails, time stamps on saved documents, configuration files, provisioning, or testing tool log files to measure real elapsed time.

Step 2 – Document Tasks and Actors. Summarize what gets accomplished at each stage and, most importantly, detail all the tasks and record which teams perform them. If a task involves multiple actors working independently with a handoff, that task should be broken down into sub-tasks.

Step 3 – Document FTE Time. Record the work effort required for each task. We’ll call that the Full Time Equivalent (FTE). This is the time it takes to do the actual task work, assuming no interruptions, irregularities, or rework.

Step 4 – Document Wait Time. Understanding wait time is critical to building a case for process automation. If actors are busy, or if there are handoffs between actors, then elapsed time is often multiple times longer than FTE time. This is because at each handoff, the task must sit in queue until a resource is ready to process the task.

After taking these steps, you can summarize in a chart similar to this.

In Lean Manufacturing, the concept of wait time or queue time has a mathematical formula [see chapter 23 of The Phoenix Project]. The definition is:

The formula, of course, offers hard proof of what you already knew – that the busier you are, the longer it takes to get new work done. With multiple actors on a task, each can contribute to wait time, with the amount they contribute depending on how busy they are.

In the example below, there are five separate teams (security, network, dev, QA and VM) involved in the Validate Firewall step in the flow. Each team is also busy with other tasks. 

Figure 2. In a manually constructed environment, the network settings, firewall rules, and application ports need to be validated. More often than not, they need to be adjusted due to port conflicts or firewall rules. Wait times correlate strongly with % ultilization.

As you can see, the time spent by FTEs is 5.5 hours, which is only around 15% of the clock time. Clearly, with complex tasks, FTE is only a part of the story.

Step 5 – Account for Unplanned Work. Unplanned work occurs when errors are found, requiring a task from an earlier step in the process to be reworked or fixed.

In complex automation, unplanned work is another reality that complicates the process and increases FTE time. It also dramatically impacts clock time – in two ways. First, there’s the direct impact of additional time spent waiting for the handoff back upstream in the process. Second, and even more dramatic, is the opportunity cost. Planned work tasks need to stop while the process actor sets things aside and addresses the unplanned work. Unplanned work can thus have a multiplier effect, causing cascading delays up and down the process flow.

One aim of automation, of course, is to reduce unplanned work – and that reduction that can also be calculated, further adding to the business case for process automation. Indeed, studies have shown that, currently, unplanned work consumes 17% of a typical IT budget.

Process Automation Can Offer More Than Cost Reduction

But there’s potentially even more to the story than a complete picture of IT work and detailed accounting of reduced work effort and timesavings. The full impact of process automation can include:

  • Improved throughput
  • Enabling rapid prototyping
  • Higher quality
  • Improved ability to respond to business needs

The cumulative impact of these can be substantial. Indeed, it can easily exceed the total impact of direct cost reductions.

Step 6 – Estimate total benefit to business functions. If calculating the value of reducing FTE, wait times, and unplanned work is relatively straight forward, figuring the full business impact of reducing overall calendar time for a critical processes (from 4 weeks to 36 hours, say) requires more than a direct cost reduction calculation. It’s worth doing, though, because the value derived from better quality, shorter development times, etc., can substantially exceed the value of FTE hours saved through automation (see figure 3). 

Figure 3. The secondary impacts of automating processes and increasing agility and consistency can be much larger than the value of the FTE hours saved.

You do it by asking IT customers to detail the benefits they see when processes are improved. There are many IT KPIs that can help here, such as the number of help desk tickets received in a specific period, or the number and length of Severity 1 IT issues.

We used this method at VMware when we automated dev/test provisioning and improved the efficiency of 600 developers by 20%. We achieved a direct cost reduction related to time and effort saved. But we found an even bigger impact, even if it was harder to quantify, in improved throughput, in always being able to say, “Yes” to business requests, and in enabling rapid prototyping.

Lessons Learned

With these steps, you can capture major process stages, tasks, actors, calendar time, work effort, and points of unplanned work, quantifying the business value of automating a process end-to-end – and making your case for end-to-end process automation all the stronger.

Key takeaways:

  • It’s possible to make a business case for automating end-to-end IT processes;
  • You can do this by applying concepts from lean manufacturing;
  • The concepts of wait time and unplanned work are central;
  • Efficiency driven cost reduction is only part of the equation, however;
  • To quantify the full value of agility, work with IT customers to gauge improvements in KPIs that reflect improved business outcomes.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

5 Key Steps to Optimizing Service Quality for Cloud-Based Services

By: Pierre Moncassin

Freebies can be hard to come by on budget airlines – but I recently received one in the form of a free lesson about designing service quality.

It was a hot day and I was on one of these ‘no-frills’ regional flights. This was obviously a well-run airline. But my overall perception of the service quickly changed after I asked for a glass of water from the attendant – who appeared to be serving refreshments generously to everyone on the flight. The attendant asked for my ticket and declared very publicly that I had the ‘wrong category’ of airfare: no extras allowed – not even a plastic cup filled with plain water.

Looking past the clichés about the headaches of no-frills airline travel, it did offer me a real lesson in service quality. The staff probably met all of their operational metrics – but that wasn’t enough to ensure an overall perception of a minimal acceptable quality. That impression was being impacted by how the service had been designed in the first place.

The same paradox applies directly to cloud services. When discussing newly established cloud services with customers, I often hear that quality is one of their top three concerns. However, quality of service is often equated with meeting specific service levels – what I would call the delivery ‘effort’. I want to argue, though, that you can make all the effort you like and still be perceived as offering poor service, if you don’t design the service right.

Traditional Service – Effort Trumps Architecture

Both budget airlines and cloud-based services are based on a high level of standardization and economies of scale, and consumers are generally very sensitive to price/quality ratios. But if you offer customers a ‘cheap’ product that they regret buying, all of your efforts at driving efficiencies can be wasted. Design, in other words, impacts perception.

So how do you build quality into a cloud service without jacking up the price at the same time? The traditional approach might be to add ‘effort’ – more stringent SLA’s, more operational staff, higher-capacity hardware resources. All of those will help, but they will also ‘gold-plate’ the service more than optimize its design – the equivalent of offering champagne to every passenger on the budget flight.

A Better Way

There is a more efficient approach – one that’s in line with the principles of VMware’s Cloud Operations: build quality upstream, when the service is defined and designed.

Here, then, are five recommendations that can help you Design First for Service Quality:

  1. From the outset, design the service end-to-end. In designing a service, we’re often tempted to focus on a narrow set of immediately important metrics (which might also be the easiest to measure) and ignore the broader perspective. But in the eyes of a consumer, quality hardly ever rests on a single metric. As you plan your initial design, combine ‘hard’ metrics (e.g. availability) with ‘soft’ metrics (e.g. customer surveys) that are likely to impact customer satisfaction down the line.
  2. Map your service dependencies. One common challenge with building quality in cloud services is that cloud infrastructure teams typically lack visibility into which part of the infrastructure delivers which part of the end user service. You can address this with mapping tools like VMware’s vCenter Infrastructure Navigator (part of the vCenter Operations Management Suite).
  3. Leverage key business-focused roles in your Cloud Center of Excellence. Designing a quality service requires close cooperation between a number of roles, including the Customer Relationship Manager, Service Owner, Service Portfolio Manager, and Service Architect (more on those roles here). In my view, Service Architects are especially key to building quality into the newly designed services, thanks to their ‘hybrid’ position between the business requirements and the technology. They’re uniquely able to evaluate the trade-offs between costs (i.e. infrastructure side) and perceived quality (business side). To go back to my airline, a good Service Architect might have decided at the design stage that a free glass of tap water is very much worth offering to ‘economy’ passengers (while Champagne, alas, is probably not).
  4. Plan for exceptions. As services are increasingly standardized and offered directly to consumers (for example, via VMware vCAC for self-provisioning), you’ll face an increasing need to handle exceptions. Perception of quality can be dramatically changed by how such user exceptions are handled. Exception handling can be built into the design, for example, via automated workflows (see this earlier blog about re-startable workflows); but also via automated interfaces with the service desk.
  5. Foster a true service culture. One major reason to setup a Cloud Center of Excellence as recommended by VMware Cloud Operations is to build a team totally dedicated to delivering high-quality services to the business. For many organizations, that requires a cultural change – moving to a truly consumer-centric perspective. From a practical point of view, the cultural change is primarily a mission for the Cloud Leader who might, for example, want to set up frequent exchanges between the other Tenant Operations roles and lines of business.

In conclusion, designing quality in cloud services relies on a precise alignment between people (organization), processes, and technologies – and on ensuring that alignment from the very start.

Of course, that’s exactly the ethos of Cloud Operations, which shifts emphasis from effort at run time (less significant, because of automation) to effort at design time (only needs to be done once). But that shift, it’s important to remember, is only possible with a cultural change.

Key Takeaways:

  • Service quality is impacted by your initial design;
  • Greater delivery effort might make up for design issues, but this is an expensive way to ‘fix’ a service after the fact;
  • A Cloud Ops approach lets you design first for service quality;
  • Follow our recommended steps for optimizing service quality;
  • Never under-estimate the cultural change required to make the transition.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

 

Outside-In Thinking: A Simple, But Powerful Part of Delivering IT as a Service

By: Paul Chapman, VMware Vice President Global Infrastructure and Cloud Operations

Moving to deliver IT as a Service can seem like a complex and challenging undertaking. Some aspects of the move do require changing the organization and adopting a radically different mindset. But, based on my experience helping lead VMware IT through the IT as a Service transition, there are also straightforward actions you can take that are simple and provide lasting and significant benefits.

Using outside-in thinking as a guiding principle is one of them.

Thinking Outside-In Versus Inside-Out

Here’s just one example that shows how outside-in thinking led us to a very different outcome than we otherwise would have achieved.

Until fairly recently, there was no way for a VMware ERP application user to self-serve a password reset. Raising a service request or calling the helpdesk were the only ways to do it. Like most organizations, we have a lot of transient and irregular users who would forget their passwords, and this in turn created an average of 500+ password reset requests a month.

Each ticket or call, once received, took an elapsed time of about 15 minutes to resolve. That equated to one and a half people on our team tied up every day doing nothing but resolving ERP log in issues, and, even more importantly, to unhappy users being placed in a holding pattern waiting to log in and perform a function.

As the VMware employee base grew, so did the number of reset requests.

The traditional, brute force IT approach to this problem would have been to add more people (volume-based hiring) to handle the growing volume of requests. Another, more nuanced, approach would be to use task automation techniques to reduce the 15 minutes down to something much faster. In fact, the initial IT team response was an approach that leveraged task automation to reduce the resolution time from 15 minutes to 5. From an inside-out perspective, that was a 66% reduction in process time. By any measure, a big improvement.

However, from the user – or outside-in – perspective, elapsed time for password reset includes the time and trouble to make the request, the time the request spends in the service desk work queue, plus the resolution time. Seen that way, process improvement yielded a shift from hours plus 15 minutes, to hours plus 5 minutes. From an outside-in perspective, then, reducing reset task time from 15 minutes to 5 minutes was basically irrelevant.

Moving to Single Sign-On

Adopting that outside-in perspective, we realized that we were users of this system too and that eliminating the need for the task altogether was a far better approach than automating the task.

In this case, we moved our ERP application to our single sign-on portal, where VMware employees log on to dozens of business applications with a single set of credentials.

With single sign-on, those 500 plus IT support requests per month have gone away. IT has claimed back the time of 1.5 staff, and, more importantly, we’ve eliminated wait time and IT friction points for our users.

It’s a very simple example – but it illustrates how changing thinking can be a powerful part of delivering IT as a Service. Even before you reach anything like full game-changing digitization of IT service delivery, a shift in perspective can let you gain and build on relatively easy quick-wins.

Key Takeaways:

  •  You can make big gains with small and simple steps en route to IT as a Service;
  • Take an outside-in perspective to IT;
  • Drive for new levels of self-service (a ‘zero touch,’ customer-centric world);
  • Think about operating in a “ticket-less” world where the “help-desk phone” should never ring;
  • Measure levels of agility and responsiveness in seconds/minutes not hours/days;
  • Adopt the mindset of a service-oriented and change-responsive organization;
  • And understand that transition is evolutionary and make step-wise, evolutionary changes to get there.

To learn more about outside-in thinking for IT, view this webcast with Paul Chapman and Ian Clayton.

Follow @VMwareCloudOps and @PaulChapmanVM on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

How to Manage Your Cloud: Lessons and Best Practices Direct from CloudOps Experts

Rich Benoit, a Consulting Architect at VMware, and Kurt Milne, VMware’s Director of CloudOps Marketing, are experts when it comes to managing cloud infrastructures. But they didn’t acquire their expertise overnight. When it comes to cloud management, the process of transitioning can take time and leave even seasoned IT pros scratching their heads, asking, “What should I do first? How do I get started?”

Join Rich and Kurt this Thursday, December 12 at 10am PT as they share the fruits of their experience as cloud managers. This webinar will dive into tangible changes that organizations need to make to be cloud-ready, including how to:

• Introduce new, specialized roles into the equation
• Improve event, incident, and problem management processes
• Establish analytics to provide visibility into the cloud

Wondering what to do and how to get started with your cloud infrastructure? Register now to save your spot!

We’ll also be live-tweeting the event via @VMwareCloudOps – follow us for updates. Also join the conversation by using the #CloudOps and #SDDC hashtags. We look forward to seeing you there!