Author Archives: CloudOps Team

Using vCloud Suite to Streamline DevOps

By: Jennifer Galvin

A few weeks ago I was discussing mobile app development and deployment with a friend. This particular friend works for a company that develops mobile applications for all platforms on a contract-by-contract basis. It’s a good business. But one of the key challenges they have is the time and effort required to install a client’s development and test environment so that they can start development. Multiple platforms need to be provisioned. And development and testing tools that may be unique to each platform must be installed. This results often in needing to maintain large teams with specialized skills and having to maintain a broad range of dev/test environments.

I have always been aware that VMware’s vCloud Suite can speed up deployment of applications, (even complex application stacks), but I didn’t know if long setup times were common in the mobile application business. So I started to ask around:

“What was the shortest time possible it would take for your development teams to make a minor change to a mobile application, on ALL mobile platforms – Android, iPhone, Windows, Blackberry, etc?”

comic part 1 The answers ranged between “months” and “never”.

Sometime later, after presenting VMware’s Software Defined Datacenter vision to a tech meetup in Washington, D.C. a gentleman approached me to discuss the question posed. While he liked the SDDC vision, he wondered if I knew of a way to use vCloud Suite and software controlled everything to speed up mobile development. So I decided to sketch out how the blueprints and automated provisioning capabilities of the vCloud Suite could help speed up application development on multiple platforms.

First, let’s figure out why this is so hard in the first place – after all, mobile development SDK’s are frameworks, and while it takes a developer to write an app, the SDK is still doing a lot of the heavy lifting. So why is this still taking so long? As it turns out, there are some major obstacles to deal with:

  • Mobile applications always need a server-side application to test against: mobile applications interact with server-side applications, and unless your server-side application is already a stable, multi-tenant application that can deal with the extra performance drain of 32 developers running amok (and you don’t mind upsetting your existing customers), you’re going to need to point them at a completely separate environment.
  • The server-side application is complex and lengthy to deploy: A 3-tier web application with infrastructure (including networking and storage), scrubbed production database data to provide some working data, and front-end load balancing is the same kind of deployment you did when the application initially went into production. You’re not going to start development on your application any time soon unless this process speeds up (and gets more automated).

Let’s solve these problems by getting a copy of the application (and a copy of production-scrubbed data) out into a new Testing area so the developers can get access to it, fast. vCloud Suite provides a framework for the server-side application developers to express its deployment as a blueprint, capable of deploying not just the code, but all the properties to automate the deployment, and consumes capacity from on-premises resources as well as those from the public cloud. That means that when it comes time to deploy a new copy (with the database refreshed and available), it’s as easy as a single click of a button.

comic part 2comic part 3Since the underlying infrastructure is virtualized, compute resources are used or migrated to make room for the new server-side application. Other testing environments can even be briefly powered down so that this testing (which is our top priority) can occur.

Anyone can deploy the application, and what used to take hours and teams of engineers can now be done by one person. However, we are still aiming to deploy this on all mobile platforms. In order to put all of our developers on this challenge, we first need to ensure they have the right tools and configurations. In the mobile world, that means more than just installing a few software packages and adjusting some settings. In some cases, that could mean you need new desktops, with entirely different operating systems.

Not every mobile vendor offers an SDK on all operating systems, and in fact, there isn’t one operating system that’s common to the top selling mobile phones today.

For example, you may only develop iOS applications using xCode, which runs only on Mac OSX. Both Windows and Android rely on an SDK compatible with Windows, and each has dependencies on external libraries to function (especially Android). Many developers typically favor MacBooks running VMware Fusion to accommodate for all of these different environments, but what if you decide that to re-write the application quickly, you require some temporary contractors? Those contractors are going to need those development environments with the right SDKs and testing.

This is also where vCloud Suite shines. It provides Desktop as a Service to those new contractors. The same platform that allowed us to provision the entire server-side application allows us to provision any client-side resources they might need.

By provisioning all of the infrastructure at once, we are now ready to redevelop our mobile app. We can spend developer time development and testing, making it the best app possible, instead of wasting resources for work environment deployment.

comic part 4

Now, let’s think back to that challenge I laid out earlier. Once you start deploying your applications using VMware’s vCloud Suite, how long will it take to improve your mobile applications across all platforms? I bet we’re not measuring that time in months any longer. Instead, mobile applications are improved in just a week or two.

Your call to action is clear:

  • Implement vCloud Suite on top of your existing infrastructure and public cloud deployments.
  • Streamline your application development process by using vCloud suite to deploy both server and client-side applications, dev and test environments, dev and test tools, and sample databases – for all platforms – at the click of a button.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

A Critical Balance of Roles Must Be in Place in the Cloud Center of Excellence

By: Pierre Moncassin

There is a rather subtle balance required to make a cloud organization effective – and, as I was reminded recently, it is easy to overlook it.

One key requirement to run a private cloud infrastructure is to establish a dedicated team i.e. a Cloud Center of Excellence. As a whole this group will act as an internal service provider in charge of all technical and functional aspects of the cloud, but also deal with the user-facing aspects of the service.

But there is an important dividing line within that group: the Center of Excellence itself is separated between Tenant Operations and Infrastructure Operations. Striking a balance between these teams is critical to a well-functioning cloud. If that balance is missing, you may encounter significant inefficiencies. Let me show you how that happened to two IT organizations I talked with recently.

First, where is that balance exactly?

If we look back at that Cloud Operating Model (described in detail in ‘Organizing for the Cloud‘), we have not one, but two teams working together: Tenant Operations and Infrastructure Operations.

In a nutshell, Tenant Operations own the ‘customer-facing’ role. They want to work closely with end-users. They want to innovate and add value. They are the ‘public face’ of the Cloud Center of Excellence.

On the other side, Infrastructure Ops only have to deal one customer – Tenant Operations. In addition to this, they also have to handle hardware, vendor relationships and generally, the ‘nuts and bolts’ of the private cloud infrastructure.

Cloud Operating Model
But why do we need a balance between two separate teams? Let’s see what can happen when that balance is missing with two real-life IT organization I met a little while back. For simplicity I will call them A and B – both large corporate entities.

When I met Organization A, it had only a ‘shell’ Tenant Operations function. In other words, their cloud team was almost exclusively focused on infrastructure. The result? Unsurprisingly, they scored fairly high on standardization and technical service levels. End users either accepted a highly standardized offering, or had to go through loops to negotiate obtained exceptions – neither option was quite satisfactory. Overall, Organization A struggled to add recognizable value to their end-users: “we are seen as a commodity”. They lacked a well-developed Tenant Organization.

Organization B had the opposite challenge. They belonged to a global technology group that majors on large-scale software development. Application development leaders could practically set the rules about what infrastructure could be provisioned. Because each consumer group yielded so much influence, there was practically a separate Tenant Operation team for each software unit.

In contrast, there was no distinguishable Infrastructure Ops function. Each Tenant Operations team could dictate separate requirements. The overall intrastructure architecture lacked standardization – which risked defeating the purpose of a cloud approach in the first place. With a balance tilted towards Tenant Operations, Organization B probably scored highest on customer satisfaction – but only as long as customers did not have to bear the full cost of non-standard infrastructure.

******

In sum having two functionally distinct teams (Tenants and Infrastructure) is not just a convenient arrangement, but a necessity to operate a private cloud effectively. There should be ongoing discussions and even negotiation between the two teams and their leaders.

In order to foster this dual structure, I would recommend:

  1. Define a chart for both teams that clearly outlines their relative roles and ‘rules of engagement.’
  2. Make clear that each team’s overall objectives are aligned, although the roles are different. That could be reflected through management objectives for the leaders of each team. However, this also requires some governance in place to give them the means to resolve their discussions.
  3. To help customers realize the benefits of standardization, consider introducing service costing (if not already in place) – so that the consumer may realize the cost of customization.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Assembling Your Cloud’s Dream Team

By: Pierre Moncassin

Putting together a Cloud Center of Excellence (COE) is not about recruiting ‘super-heroes’ – but a matter of balancing skills and exploiting learning opportunities. 

On several occasions, I’ve heard customers who are embarking on the journey to the cloud ask: “How exactly do you go about putting together a ‘dream team’ capable of launching and delivering cloud services in my organization?” VMware Cloud Operation’s advice is fairly straightforward: put together a core team known as the Cloud Center of Excellence as early as possible. However, these team members will need a broad range of skills across cloud management tools, virtualization, networking and storage, as well as solid processes and organizational knowledge. Not to mention, sound business acumen as they will be expected to work closely with the business lines (far more so than traditional IT silos).

This is why at first sight, the list of skills can seem daunting. It need not be. The good news is that there is no need to try to recruit ‘super-heroes’ with impossibly long resumes. The secret is to balance skills, and taking advantage of several important opportunities to build skills.

***

First let’s have a closer look at the skills profiles for the Cloud COE as described in our whitepaper  ‘Organizing for the Cloud’. I won’t go into the specifics of each role, but as a starter, here are some core technical skills required (list not inclusive):

  • Virtualization technologies: vSphere
  • Provisioning with a combination of vCAC or VCD
  • Workflows Automation: VCO
  • Configuration and Compliance: VCM
  • Monitoring/Event Management: VC OPS
  • Applications, storage, virtual networks, applications.

But the team also needs members with broad understanding of processes and systems engineering principles, customer facing service development skills, and a leader with  sound knowledge of financial and business principles.

Few organizations will have individuals with all these skills ready from day one – but fortunately they do not need to. A cloud COE is a structure that will grow over time, and the same applies to the skills base. For example, vCO scripting skills might be required at some stage in order to develop advanced automation scripts – but that level of automation might not be required until the second or third phase of the cloud implementation, after the workflows are established. However, we need some planning to have the skills available when needed.

Make the most out of on-site experts:

Organizations usually start their cloud journey with project team as a transitional structure. They generally have consultants from VMware or a consultancy partner on-site working alongside them.  This offers an excellent opportunity to both accelerate the cloud project, and to allow internal hires to absorb critical skills from those experts. However – and this is an important caveat – the knowledge transfer needs to be intentional. Organizations can’t expect a transfer of knowledge and skills to happen entirely unprompted. Internal teams may not always have the availability or training to absorb cloud-related skills ‘spontaneously’ during the project. Ad hoc team members often have emergencies from their ‘day job’ (i.e. their business-as-usual responsibilities) that interrupt their work with the on-site experts.  So I advise to plan knowledge exchanges early in the project. That will ensure that external vendors and consultants train the internal staff, and in turn, project team members can then transfer their knowledge to the permanent cloud team.

Get formal training:

Along with informal on-site knowledge transfer, it can be a good idea to plan formal classroom-based VMware training and certifications. Compared to a project-based knowledge exchange, formal training generally provides a deeper understanding of the fundamentals, and is also valuable to employees from a personal development point of view. Team members may have additional motivation to attend formal courses that are recognized in the industry, especially if it leads to a recognized qualification such as VMware Certified Professional (VCP).

Build skills during the pilot project:

Many cloud projects begin with a pilot phase where a pilot (i.e. a prototype installation) is deployed with a cut-down functionality. This is a great opportunity to build skills in a ‘safe’ environment. Core team members get the chance to familiarize themselves both with the new technology and stakeholders. For example, a Service Catalog becomes far more real once potential users and administrators can see and touch the provisioning functions with a tool like vCenter Automation Center. For technical specialists, the pilot can be a chance to learn new technologies and overcome any fear of change. Building a prototype early in the cloud project can also give teams the opportunity to play around with tools and explore their capabilities.

A summary of how your IT organization can structure its Cloud Center of Excellence and prepare it for success:

  1. Plan to build up skills over time. All technical skills are not required in-depth from day one. Rather, look at a blend of technical skills that will grow and evolve as the cloud organization matures. The Cloud team is a ‘learning organization’.
  2. Plan ahead. Schedule formal and informal knowledge transfers – including formal training – between internal staff and external vendors and consultants.
  3. Make the most out of a pilot project. Create a safe learning environment where team members and stakeholders can acquire skills at their own pace.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

From “Routine” Patch Management to Compliance-as-a-Service: How Cloud Transforms a Costly Maintenance Function into an Innovative Value-Adding Service

By: Pierre Moncassin

You might not think of Patch Management as the coolest part of running an IT operation (cloud or non-cloud). It is a challenging, often time-consuming part of keeping infrastructure secure. And in many industries, it is a critical dependency for meeting regulatory requirements. As a result, IT managers can be forgiven for thinking of it as hard work that just needs to get done.

But recent discussions with global vCloud customers have given me a new perspective on the Patch Management function. These customers’ service owners have quickly grasped some new possibilities that come with an automated cloud infrastructure, one of which involves exploring how to offer patch management services to their end users. While this is not a new concept for managed providers, it’s a step change for an internal IT department – and one that turns Patch Management into a pretty exciting role.

Let us look at the key elements of that change:

To start with – we now see the patching toolset as a key component of end-user service. There is no point in offering patching services in a cloud environment without reliable automation to deliver them – the core toolset for patch reporting is of course VMware’s vCenter Configuration Manager, complemented by vCenter Orchestration for automation of specific remediation workflows.

But beyond the automated patch level checking and deployment, VCM also opens the possibility of adapting compliance reports for each user. Without going into too much detail here, it’s relatively straight forward to configure VCM in a ‘service-aware’ structure. VCM allows administrators to define virtual machine groups that match, for example, the virtual servers assigned to a specific division. Filters can be further set to extract only the patch information relating to that end user. Then your VCM reports can be exported into customer-facing reporting tools to provide customized compliance reports.

Make that switch to a service mindset. Of course, bringing the idea of ‘service’ into the patching activity means a mindset shift. It implies service levels and well-defined expectations on both sides. And it invites a financial perspective, too: the cost of delivering the services can be evaluated and potentially communicated back to the consumer. As the cloud organization matures, it may consider charging for those specific services.

Moving to a service approach for patching, then, can be a stepping-stone towards delivering further value-added services to the end user: not just routine patching, but other compliance services that can become more and more visible. Again, patching has moved well beyond its traditional role.

Next, integrate patching with Self-Service capabilities. As with any on-demand service, patching-as-a-service will need to be published in the Service Catalog. In all likelihood, patch management would be offered as an option complementing another service (e.g. server or application provisioning). There are many ways to publish such a service – on the service portal, for example, (if there is such a dedicated portal), or directly within vCenter Automation Center (vCAC). In vCAC, a patch management service could be made available either at provisioning time, or, potentially, at runtime when a machine is already running (vCAC for example can issue a ticket to the service desk to make the request).

Beyond the Service Catalog, there is also interesting integration potential if patching requirements are ‘pre-built’ into the vCAC blueprint. In a nutshell, vCAC can be configured to select the patching option that will be applied by vCenter Configuration Manager at runtime. In my view, that type of integration has considerable potential – potential I’ll explore in a separate blog in the near future.

Lastly, communicate. It’s an obvious part of a service mindset, but also one that’s easy to overlook: if you are adopting this new way of looking at patch management, you need to ensure that two-way communication takes place with your end users, whether to define the service, to publish it, or during its delivery. This is an extended, if not new function for the service owners.

In sum:

Patch Management – often associated with ‘hard work’ done in the background – is transformed with Cloud Operations:

  • ‘Hard work’ becomes ‘service design work’ – a common theme across VMware Cloud Operations
  • Team focus shifts from ‘keeping the servers running’ to a more creative activity closely engaged with consumers – offering a new, value-adding service
  • Patching services set the foundation for more comprehensive services under the umbrella of Compliance-as-a-service
  • Technical integrations between self-service provisioning and patch management can be leveraged to open new avenues for self-service automation

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Aligned Incentives – and Cool, Meaningful New Jobs! – In the Cloud Era

By: Paul Chapman, VMware Vice President Global Infrastructure and Cloud Operations

Transforming IT service delivery in the cloud era means getting all your technical ducks in a row. But those ducks won’t ever fly if your employees do not have aligned incentives.

Incentives to transform have to be aligned from top to bottom – including service delivery strategy, operating model, organizational construct, and individual job functions. Otherwise, you’ll have people in your organization wanting to work against changes that are vital for success, and in some cases almost willing for them to fail.

This can be a significant issue with what I call ‘human middleware.’ It’s that realm of work currently done by skilled employees that is both standard and repeatable at the same time: install a database; install an operating system; configure the database; upgrade the operating system; tune the operating system, etc..

These roles are prime for automation and/or digitization – allowing the same functions to be performed more efficiently, more predictably, game-changingly faster, and giving the IT organization the flexibility it needs to deliver IT as a Service.

Of course, automation also offers people in these roles the chance to move to more meaningful and interesting roles – but therein lies the aligned incentive problem. People who have built their expertise in a particular technology area over an extended period of time are less likely to be incentivized to give that up and transition to doing something ‘different.’

Shifting Roles – A VMware Example

Here’s one example from VMware IT – where building out a complete enterprise SDLC instance for a complex application environment once took 20 people 3-6 weeks.

We saw the opportunity to automate the build process in our private cloud and, indeed, with blueprints, scripting, and automation, what took 20 people 3-6 weeks, now takes 3 people less than 36 hours.

But shifting roles and aligning incentives was also very critical to making this happen.

Here was our perspective: the work of building these environments over and over again was not hugely engaging. Much of it involved coordinating efforts and requesting task work via ticketing systems, but people were also entrenched in their area of expertise and years of gained experience, so they were less inclined to automate their own role in the process. The irony was that in leveraging automation to significantly reduce the human effort and speed up service delivery, we could actually free people up to do more meaningful work – work that in turn would be much more challenging and rewarding for them.

In this case, employees went from doing standard repeatable tasks to high order blueprinting, scripting, and managing and tuning the automation process. In many cases, though, these new roles required new but extensible skills. So in order to help them be successful, we made a key decision: we would actively help (in a step-wise, non-threatening, change-management-focused way) the relevant employees grow their skills. And we’d free them up from their current roles to focus on the “future” skills that were going to be required.

Three New Roles

So there’s the bottom line incentive that can shift employees from undermining a transformation to supporting it: you can say, “yes, your role is changing, but we can help you grow into an even more meaningful role.”

And as automation frees people up and a number of formerly central tasks fall away, interesting new roles do emerge – here, for example, are three new jobs that we now have at VMware:

  •  Blueprint Designer – responsible for designing and architecting blueprints for building the next generation of automated or digitized services.
  •  Automation Engineer – responsible for engineering scripts that will automate or digitize business process and or IT services.
  •  Services Operations Manager – responsible for applications and tenant operation services in the new cloud-operating model.

The Cloud Era of Opportunity

The reality is that being an IT professional has always been highly dynamic. Of the dozen or so different IT positions that I’ve held in my career, the majority don’t exist anymore. Constant change is the steady state in IT.

Change can be uncomfortable, of course. But given its inevitability, we shouldn’t – and can’t – fight it. We should get in front of the change and engineer the transformation for success. And yet too frequently we don’t – often because we’re incented to want to keep things as they are. Indeed, misaligned incentives remain one the biggest impediments to accelerating change in IT.

We can, as IT leaders, shift those incentives, and with them an organization’s cultural comfort with regular change. And given the positives that transformation can bring both the organization and its employees, it’s clear that we should do all we can to make that shift happen.

Major Takeaways:

  • Aligning incentives is a key part of any ITaaS transformation
  • Automation will eliminate some roles, but also create more meaningful roles and opportunities for IT professionals
  • Support, coaching, and communication about new opportunities will help accelerate change
  • Defining a change-management strategy for employee freedom and support for their transition are critical for success

Follow @VMwareCloudOps and @PaulChapmanVM on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

The Top 10 CloudOps Blogs of 2013

What a year it’s been for the CloudOps team! Since launching the CloudOps blog earlier this year, we’ve published 63 items and have seen a tremendous response from the larger IT and cloud operations community.

Looking back on 2013, we wanted to highlight some of the top performing content and topics from the CloudOps blog this past year:

1. “Workload Assessment for Cloud Migration Part 1: Identifying and Analyzing Your Workloads” by Andy Troup
2. “Automation – The Scripting, Orchestration, and Technology Love Triangle” by Andy Troup
3. “IT Automation Roles Depend on Service Delivery Strategy” by Kurt Milne
4. “Workload Assessment for Cloud Migration, Part 2: Service Portfolio Mapping” by Andy Troup
5. “Tips for Using KPIs to Filter Noise with vCenter Operations Manager” by Michael Steinberg and Pierre Moncassin
6. “Automated Deployment and Testing Big ‘Hairball’ Application Stacks” by Venkat Gopalakrishnan
7. “Rethinking IT for the Cloud, Pt. 1 – Calculating Your Cloud Service Costs” by Khalid Hakim
8. “The Illusion of Unlimited Capacity” by Andy Troup
9. “Transforming IT Services is More Effective with Org Changes” by Kevin Lees
10. “A VMware Perspective on IT as a Service, Part 1: The Journey” by Paul Chapman

As we look forward to 2014, we want to thank you, our readers, for taking the time to follow, share, comment, and react to all of our content. We’ve enjoyed reading your feedback and helping build the conversation around how today’s IT admins can take full advantage of cloud technologies.

From IT automation to patch management to IT-as-a-Service and beyond, we’re looking forward to bringing you even more insights from our VMware CloudOps pros in the New Year. Happy Holidays to all – we’ll see you in 2014!

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

The Case for Upstream Remediation: The Third Pillar of Effective Patch Management for Cloud Computing

By: Pierre Moncassin

Patch Management fulfills an essential function in IT operations: it keeps your multiple software layers up to date, as free of vulnerabilities as possible, and consistent with vendor guidelines.

But scale that to an ever-dynamic environment like a VMware-based cloud infrastructure, and you have an extra challenge on your hands. Not only do the patches keep coming, but end users keep provisioning and amending their configuration. So how to keep track of all these layers of software?

In my experience there are three pillars that need to come together to support effective patch management in the Cloud. The first two, policy and automation, are fairly well established. But I want to make a case for a third: upstream remediation.

As a starting point, you need a solid patching policy. This may sound obvious, but the devil is in the details. Such a policy needs to be defined and agreed across a broad spectrum of stakeholders, starting with the security team. This is typically more of a technical document than a high-level security policy, and it’s far more detailed than, say, a simple rule of thumb (e.g. ‘you must apply the latest patch within X days’).

A well-written policy must account for details such as exceptions (e.g. how to remedy non-compliant configurations); security tiers (which may have different patching requirements); reporting; scheduling of patch deployment, and more.

The second pillar is Automation for Patch Management. While the need for a patching policy is clearly not specific to Cloud Infrastructure, its importance is magnified in an environment where configurations evolve rapidly and automation is pervasive. And such automation would obviously make little sense without a well-defined policy. For this, you can use a tool like VMware’s vCenter Configuration Manager (VCM).

VCM handles three key aspects of patching automation:

  1. Reporting – i.e. verifying patch levels on selected groups of machines
  2. Checking for bulleting updates on vendor sites (e.g. Microsoft)
  3. Applying patches via automated installation

In a nutshell, VCM will automate both the detection and remediation of most patching issues.

However, one other key step is easily overlooked – upstream remediation. In a cloud infrastructure, we want to remediate not just the ‘live’ configurations, but also the templates used for provisioning. This will ensure that the future configurations being provisioned are also compliant. Before the ‘cloud’ era, administrators who identified a patching issue might make a note to update their standard builds in the near future – but there would rarely be a critical urgency. In cloud environments where new machines might be provisioned say, every few seconds, this sort of updates need to happen much faster.

As part of completing any remediation, you also need to be sure to initiate a procedure to carry out updates to your blueprints, as well as to your live workloads (see the simplified process view above).

You need to remember, though, that remediating the images will depend on different criteria from the ‘live’ workload and, depending on the risk, may require a change request and related approval. You need to update the images, test that the updates are working, and then close out the change request.

In sum, this approach reflects a consistent theme across Cloud Operations processes: that the focus of activity is shifted upstream towards the demand side. This also applies to Patch Management: remediation needs to be extended to apply upstream to the provisioning blueprints (i.e. images).

Key takeaways:

  • Policy and automation are two well-understood pillars of patch management;
  • A less well-recognized third pillar is upstream remediation;
  • Upstream remediation addresses the compliance and quality of future configurations;
  • This reflects a common theme in Cloud Ops processes: that focus shifts to the demand side.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

Understanding Process Automation: Lean Manufacturing Lessons Applied to IT

by: Mike Szafranski

With task automation, it is pretty simple to calculate that it is worth taking 2 hours to automate a 10-minute task if you perform that task more than 12 times. Even considering the fixed and variable costs of the automation solution, the math is pretty straightforward.

But the justification for automating more complex processes composed of dozens of ‘10 minute tasks’ completed by different actors – including the inevitable scheduling and wait time between each task – is a bit more complex. Nonetheless, an approach exists.

You can find it laid out in Kim, Behr, and Spafford’s modern classic of business fiction, The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win [IT Revolution Press, 2013], in which the authors show how the principals of lean manufacturing are directly applicable to IT process automation.

So what lessons do we learn when building a case for process automation by applying lean manufacturing principles to IT Ops? Let’s take a look.

Simple Steps Build the Business Case

First, you need to break the process you’re interested in into its constituent parts.

Step 1 – Document Stages in the Process and Elapsed Time. Through interviews, identify major process stages and then document the clock time elapsed for each. Note, use hard data for elapsed time if possible. People involved in the process rarely have an accurate perception of how long things really take. Look at process artifacts such as emails, time stamps on saved documents, configuration files, provisioning, or testing tool log files to measure real elapsed time.

Step 2 – Document Tasks and Actors. Summarize what gets accomplished at each stage and, most importantly, detail all the tasks and record which teams perform them. If a task involves multiple actors working independently with a handoff, that task should be broken down into sub-tasks.

Step 3 – Document FTE Time. Record the work effort required for each task. We’ll call that the Full Time Equivalent (FTE). This is the time it takes to do the actual task work, assuming no interruptions, irregularities, or rework.

Step 4 – Document Wait Time. Understanding wait time is critical to building a case for process automation. If actors are busy, or if there are handoffs between actors, then elapsed time is often multiple times longer than FTE time. This is because at each handoff, the task must sit in queue until a resource is ready to process the task.

After taking these steps, you can summarize in a chart similar to this.

In Lean Manufacturing, the concept of wait time or queue time has a mathematical formula [see chapter 23 of The Phoenix Project]. The definition is:

The formula, of course, offers hard proof of what you already knew – that the busier you are, the longer it takes to get new work done. With multiple actors on a task, each can contribute to wait time, with the amount they contribute depending on how busy they are.

In the example below, there are five separate teams (security, network, dev, QA and VM) involved in the Validate Firewall step in the flow. Each team is also busy with other tasks. 

Figure 2. In a manually constructed environment, the network settings, firewall rules, and application ports need to be validated. More often than not, they need to be adjusted due to port conflicts or firewall rules. Wait times correlate strongly with % ultilization.

As you can see, the time spent by FTEs is 5.5 hours, which is only around 15% of the clock time. Clearly, with complex tasks, FTE is only a part of the story.

Step 5 – Account for Unplanned Work. Unplanned work occurs when errors are found, requiring a task from an earlier step in the process to be reworked or fixed.

In complex automation, unplanned work is another reality that complicates the process and increases FTE time. It also dramatically impacts clock time – in two ways. First, there’s the direct impact of additional time spent waiting for the handoff back upstream in the process. Second, and even more dramatic, is the opportunity cost. Planned work tasks need to stop while the process actor sets things aside and addresses the unplanned work. Unplanned work can thus have a multiplier effect, causing cascading delays up and down the process flow.

One aim of automation, of course, is to reduce unplanned work – and that reduction that can also be calculated, further adding to the business case for process automation. Indeed, studies have shown that, currently, unplanned work consumes 17% of a typical IT budget.

Process Automation Can Offer More Than Cost Reduction

But there’s potentially even more to the story than a complete picture of IT work and detailed accounting of reduced work effort and timesavings. The full impact of process automation can include:

  • Improved throughput
  • Enabling rapid prototyping
  • Higher quality
  • Improved ability to respond to business needs

The cumulative impact of these can be substantial. Indeed, it can easily exceed the total impact of direct cost reductions.

Step 6 – Estimate total benefit to business functions. If calculating the value of reducing FTE, wait times, and unplanned work is relatively straight forward, figuring the full business impact of reducing overall calendar time for a critical processes (from 4 weeks to 36 hours, say) requires more than a direct cost reduction calculation. It’s worth doing, though, because the value derived from better quality, shorter development times, etc., can substantially exceed the value of FTE hours saved through automation (see figure 3). 

Figure 3. The secondary impacts of automating processes and increasing agility and consistency can be much larger than the value of the FTE hours saved.

You do it by asking IT customers to detail the benefits they see when processes are improved. There are many IT KPIs that can help here, such as the number of help desk tickets received in a specific period, or the number and length of Severity 1 IT issues.

We used this method at VMware when we automated dev/test provisioning and improved the efficiency of 600 developers by 20%. We achieved a direct cost reduction related to time and effort saved. But we found an even bigger impact, even if it was harder to quantify, in improved throughput, in always being able to say, “Yes” to business requests, and in enabling rapid prototyping.

Lessons Learned

With these steps, you can capture major process stages, tasks, actors, calendar time, work effort, and points of unplanned work, quantifying the business value of automating a process end-to-end – and making your case for end-to-end process automation all the stronger.

Key takeaways:

  • It’s possible to make a business case for automating end-to-end IT processes;
  • You can do this by applying concepts from lean manufacturing;
  • The concepts of wait time and unplanned work are central;
  • Efficiency driven cost reduction is only part of the equation, however;
  • To quantify the full value of agility, work with IT customers to gauge improvements in KPIs that reflect improved business outcomes.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

5 Key Steps to Optimizing Service Quality for Cloud-Based Services

By: Pierre Moncassin

Freebies can be hard to come by on budget airlines – but I recently received one in the form of a free lesson about designing service quality.

It was a hot day and I was on one of these ‘no-frills’ regional flights. This was obviously a well-run airline. But my overall perception of the service quickly changed after I asked for a glass of water from the attendant – who appeared to be serving refreshments generously to everyone on the flight. The attendant asked for my ticket and declared very publicly that I had the ‘wrong category’ of airfare: no extras allowed – not even a plastic cup filled with plain water.

Looking past the clichés about the headaches of no-frills airline travel, it did offer me a real lesson in service quality. The staff probably met all of their operational metrics – but that wasn’t enough to ensure an overall perception of a minimal acceptable quality. That impression was being impacted by how the service had been designed in the first place.

The same paradox applies directly to cloud services. When discussing newly established cloud services with customers, I often hear that quality is one of their top three concerns. However, quality of service is often equated with meeting specific service levels – what I would call the delivery ‘effort’. I want to argue, though, that you can make all the effort you like and still be perceived as offering poor service, if you don’t design the service right.

Traditional Service – Effort Trumps Architecture

Both budget airlines and cloud-based services are based on a high level of standardization and economies of scale, and consumers are generally very sensitive to price/quality ratios. But if you offer customers a ‘cheap’ product that they regret buying, all of your efforts at driving efficiencies can be wasted. Design, in other words, impacts perception.

So how do you build quality into a cloud service without jacking up the price at the same time? The traditional approach might be to add ‘effort’ – more stringent SLA’s, more operational staff, higher-capacity hardware resources. All of those will help, but they will also ‘gold-plate’ the service more than optimize its design – the equivalent of offering champagne to every passenger on the budget flight.

A Better Way

There is a more efficient approach – one that’s in line with the principles of VMware’s Cloud Operations: build quality upstream, when the service is defined and designed.

Here, then, are five recommendations that can help you Design First for Service Quality:

  1. From the outset, design the service end-to-end. In designing a service, we’re often tempted to focus on a narrow set of immediately important metrics (which might also be the easiest to measure) and ignore the broader perspective. But in the eyes of a consumer, quality hardly ever rests on a single metric. As you plan your initial design, combine ‘hard’ metrics (e.g. availability) with ‘soft’ metrics (e.g. customer surveys) that are likely to impact customer satisfaction down the line.
  2. Map your service dependencies. One common challenge with building quality in cloud services is that cloud infrastructure teams typically lack visibility into which part of the infrastructure delivers which part of the end user service. You can address this with mapping tools like VMware’s vCenter Infrastructure Navigator (part of the vCenter Operations Management Suite).
  3. Leverage key business-focused roles in your Cloud Center of Excellence. Designing a quality service requires close cooperation between a number of roles, including the Customer Relationship Manager, Service Owner, Service Portfolio Manager, and Service Architect (more on those roles here). In my view, Service Architects are especially key to building quality into the newly designed services, thanks to their ‘hybrid’ position between the business requirements and the technology. They’re uniquely able to evaluate the trade-offs between costs (i.e. infrastructure side) and perceived quality (business side). To go back to my airline, a good Service Architect might have decided at the design stage that a free glass of tap water is very much worth offering to ‘economy’ passengers (while Champagne, alas, is probably not).
  4. Plan for exceptions. As services are increasingly standardized and offered directly to consumers (for example, via VMware vCAC for self-provisioning), you’ll face an increasing need to handle exceptions. Perception of quality can be dramatically changed by how such user exceptions are handled. Exception handling can be built into the design, for example, via automated workflows (see this earlier blog about re-startable workflows); but also via automated interfaces with the service desk.
  5. Foster a true service culture. One major reason to setup a Cloud Center of Excellence as recommended by VMware Cloud Operations is to build a team totally dedicated to delivering high-quality services to the business. For many organizations, that requires a cultural change – moving to a truly consumer-centric perspective. From a practical point of view, the cultural change is primarily a mission for the Cloud Leader who might, for example, want to set up frequent exchanges between the other Tenant Operations roles and lines of business.

In conclusion, designing quality in cloud services relies on a precise alignment between people (organization), processes, and technologies – and on ensuring that alignment from the very start.

Of course, that’s exactly the ethos of Cloud Operations, which shifts emphasis from effort at run time (less significant, because of automation) to effort at design time (only needs to be done once). But that shift, it’s important to remember, is only possible with a cultural change.

Key Takeaways:

  • Service quality is impacted by your initial design;
  • Greater delivery effort might make up for design issues, but this is an expensive way to ‘fix’ a service after the fact;
  • A Cloud Ops approach lets you design first for service quality;
  • Follow our recommended steps for optimizing service quality;
  • Never under-estimate the cultural change required to make the transition.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

 

Outside-In Thinking: A Simple, But Powerful Part of Delivering IT as a Service

By: Paul Chapman, VMware Vice President Global Infrastructure and Cloud Operations

Moving to deliver IT as a Service can seem like a complex and challenging undertaking. Some aspects of the move do require changing the organization and adopting a radically different mindset. But, based on my experience helping lead VMware IT through the IT as a Service transition, there are also straightforward actions you can take that are simple and provide lasting and significant benefits.

Using outside-in thinking as a guiding principle is one of them.

Thinking Outside-In Versus Inside-Out

Here’s just one example that shows how outside-in thinking led us to a very different outcome than we otherwise would have achieved.

Until fairly recently, there was no way for a VMware ERP application user to self-serve a password reset. Raising a service request or calling the helpdesk were the only ways to do it. Like most organizations, we have a lot of transient and irregular users who would forget their passwords, and this in turn created an average of 500+ password reset requests a month.

Each ticket or call, once received, took an elapsed time of about 15 minutes to resolve. That equated to one and a half people on our team tied up every day doing nothing but resolving ERP log in issues, and, even more importantly, to unhappy users being placed in a holding pattern waiting to log in and perform a function.

As the VMware employee base grew, so did the number of reset requests.

The traditional, brute force IT approach to this problem would have been to add more people (volume-based hiring) to handle the growing volume of requests. Another, more nuanced, approach would be to use task automation techniques to reduce the 15 minutes down to something much faster. In fact, the initial IT team response was an approach that leveraged task automation to reduce the resolution time from 15 minutes to 5. From an inside-out perspective, that was a 66% reduction in process time. By any measure, a big improvement.

However, from the user – or outside-in – perspective, elapsed time for password reset includes the time and trouble to make the request, the time the request spends in the service desk work queue, plus the resolution time. Seen that way, process improvement yielded a shift from hours plus 15 minutes, to hours plus 5 minutes. From an outside-in perspective, then, reducing reset task time from 15 minutes to 5 minutes was basically irrelevant.

Moving to Single Sign-On

Adopting that outside-in perspective, we realized that we were users of this system too and that eliminating the need for the task altogether was a far better approach than automating the task.

In this case, we moved our ERP application to our single sign-on portal, where VMware employees log on to dozens of business applications with a single set of credentials.

With single sign-on, those 500 plus IT support requests per month have gone away. IT has claimed back the time of 1.5 staff, and, more importantly, we’ve eliminated wait time and IT friction points for our users.

It’s a very simple example – but it illustrates how changing thinking can be a powerful part of delivering IT as a Service. Even before you reach anything like full game-changing digitization of IT service delivery, a shift in perspective can let you gain and build on relatively easy quick-wins.

Key Takeaways:

  •  You can make big gains with small and simple steps en route to IT as a Service;
  • Take an outside-in perspective to IT;
  • Drive for new levels of self-service (a ‘zero touch,’ customer-centric world);
  • Think about operating in a “ticket-less” world where the “help-desk phone” should never ring;
  • Measure levels of agility and responsiveness in seconds/minutes not hours/days;
  • Adopt the mindset of a service-oriented and change-responsive organization;
  • And understand that transition is evolutionary and make step-wise, evolutionary changes to get there.

To learn more about outside-in thinking for IT, view this webcast with Paul Chapman and Ian Clayton.

Follow @VMwareCloudOps and @PaulChapmanVM on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.