Home > Blogs > VMware Operations Transformation Services > Tag Archives: cloud management

Tag Archives: cloud management

6 Processes You Should Automate to Provide IT-as-a-Service

kai_holthaus-cropBy Kai Holthaus

IT-as-a-Service (ITaaS) is one of the current paradigm shifts in managing IT organizations and service delivery. It represents an “always-on” approach to services, where IT services are available to customers and users almost instantly, allowing unprecedented flexibility on the business side with regards to using IT services to enable business processes.

This brave new world requires a higher degree of automation and orchestration than is common in today’s IT organizations. This blog post describes some of the new areas of automation IT managers need to think about.

1&2) Event Management and Incident Management

This is the area where automation and orchestration got their start – automated tools and workflow to monitor whether servers, networks, storage—even applications—are still available and performing the way they should be. An analysis should be performed to study whether events, when detected, could be handled in an automated fashion, ideally before the condition causes an actual incident.

If an incident already happened, incident models can be defined and automated, implementing self-healing techniques to resolve the incident. In this case, an incident record must be created and updated as part of executing the incident model. Also, it may be advisable to review the number of incident models executed within a given time period, to determine if a problem investigation should be started.

It is important to note that when a workflow makes these kinds of changes in an automatic fashion, at the very least the configuration management system must be updated per the organization’s policies.

3) Request Fulfillment

Automation and orchestration tools are removing the manual element from request fulfillment. Examples include:

  • Requests for new virtual machines, databases, additional storage space or other infrastructure
  • Requests for end-user devices and accessories
  • Requests for end-user software
  • Request for access to a virtual desktop image (VDI) or delivery of an application to a VDI

Fulfillment workflows can be automated to minimize human interaction. Such human interaction can often be reduced to the approval step, as required.

Again, it is important that the configuration management system gets updated per the organization’s policies since it is part of the workflows.

4&5) Change and Configuration Management

Technology today already allows the automation of IT processes that usually require change requests, as well as approvals, implementation plans, and change reviews. For instance, virtual machine hypervisors and management software such—such as vSphere—can automatically move virtual machines from one physical host to another in a way that is completely transparent to the user.

Besides automating change, the configuration management system should be automatically updated so that support personnel always have accurate information available when incidents need to be resolved.

6) Continuous Deployment

The examples provided so far for automating activities in an IT organization were operations-focused. However, automation should also be considered in other areas, such as DevOps.

Automation and orchestration tools can define, manage, and automate existing release processes, configuring workflow tasks and governance policies used to build, test, and deploy software at each stage of the delivery processes. The automation can also model existing gating rules between the different stages of the process. In addition, automation ensures the correct version of the software is being deployed in the correct environments. This includes integrating with existing code management systems, such as version control, testing, or bug tracking solutions, as well as change management and configuration management procedures.

In an ITaaS model, automation is no longer optional. To fulfill the promise of an always-on IT service provider—and remain the preferred service-provider of your customers—consider automating these and other processes.


Kai Holthaus is a delivery manager with VMware Operations Transformation Services and is based in Oregon.

Green vs. Grey — Rethinking Your IT Operations

Neil MitchellBy Neil Mitchell

Can you really create a new greenfield IT organization with no legacy constraints?

In this short video, operations architect Neil Mitchell explains that while anything is theoretically possible, most IT execs need to face the reality of impact on legacy IT operations.

====
Neil Mitchell is an operations architect with the VMware Operations Transformation global practice and is based in the UK.

The Business Case for Cloud Automation

Automating in the Cloud Pays Off

How to Take Charge of Incident Ticket Ping Pong

By Pierre Moncassin

Pierre Moncassin-cropWhen incident tickets are repeatedly passed from one support team to another, I like to describe it as a “ping pong” situation. Most often this is not a lack of accountability or skills within individual teams. Each team genuinely fails to see the incident as relevant to their technical silo. They each feel perfectly legitimate in either assigning the ticket to another team, or even assigning it back to the team they took it from.

And the ping pong game continues.

Unfortunately for the end user, the incident is not resolved whilst the reassignments continue. The situation can easily escalate into SLA breaches, financial penalties, and certainly disgruntled end users.

How can you prevent such situations? IT service management (ITSM) has been around for a long while, and there are known mitigations to handle these situations. Good ITSM practice would dictate some type of built-in mechanisms to prevent incidents being passed back and forth. For example:

  • Define end-to-end SLAs for incident resolution (not just KPIs for each resolution team), and make each team aware of these SLAs.
  • Configure the service desk tool to escalate automatically (and issue alerts) after a number of reassignments, so that management becomes quickly aware of the situation.
  • Include cross-functional resolution teams as part of the resolution process (as is often done for major incident situations).

In my opinion there is a drawback to these approaches—they take time and effort to put in place; incidents may still fall through the cracks. But with a cloud management platform like VMware vRealize Suite, you can take prevention to another level.

A core reason for ping pong situations often lies in the team’s inability to pinpoint the root cause of the incident. VMware vRealize Operations Manager (formerly known as vCenter Operations Manager) provides increased visibility into the root cause, through root cause analysis capabilities. Going one step further, vRealize Operations Manager gives advance warning on impending incidents—thanks to its analytical capabilities. In the most efficient scenario, support teams are warned of the impending incident and its cause, well ahead of the incident being raised. Most of the time, the incident ping pong game should never start.

Takeaways:

  • Build a solid foundation with the classic ITSM approaches based on SLAs and assignment rules.
  • Leverage proactive resolution, and take advantage of enhanced root cause analysis that vRealize Operations Manager offers via automation to reduce time wasted on incident resolution.


Pierre Moncassin is an operations architect with the VMware Operations Transformation global practice and is based in Taipei. Follow @VMwareCloudOps on Twitter for future updates.

 

VMware #1 in IDC Worldwide Datacenter Automation Software Vendor Shares

Today’s VMware Company Blog announces that market research firm IDC has named VMware the leading datacenter automation software vendor based on 2013 software revenues.(1)

IDC’s report, “Worldwide Datacenter Automation Software 2013 Vendor Shares,” determined that VMware’s lead in 2013 jumped 65.6 percent over 2012 results and its market share now stands at 24.1 percent, more than 10 percentage points above the second place vendor. Overall, the worldwide market for datacenter automation grew by 22.1 percent to $1.8 billion in 2013. Download full IDC report here.

(1)   IDC, “Worldwide Datacenter Automation Software 2013 Vendor Shares,” by Mary Johnston Turner, May 2014

The Case for Upstream Remediation: The Third Pillar of Effective Patch Management for Cloud Computing

By: Pierre Moncassin

Patch Management fulfills an essential function in IT operations: it keeps your multiple software layers up to date, as free of vulnerabilities as possible, and consistent with vendor guidelines.

But scale that to an ever-dynamic environment like a VMware-based cloud infrastructure, and you have an extra challenge on your hands. Not only do the patches keep coming, but end users keep provisioning and amending their configuration. So how to keep track of all these layers of software?

In my experience there are three pillars that need to come together to support effective patch management in the Cloud. The first two, policy and automation, are fairly well established. But I want to make a case for a third: upstream remediation.

As a starting point, you need a solid patching policy. This may sound obvious, but the devil is in the details. Such a policy needs to be defined and agreed across a broad spectrum of stakeholders, starting with the security team. This is typically more of a technical document than a high-level security policy, and it’s far more detailed than, say, a simple rule of thumb (e.g. ‘you must apply the latest patch within X days’).

A well-written policy must account for details such as exceptions (e.g. how to remedy non-compliant configurations); security tiers (which may have different patching requirements); reporting; scheduling of patch deployment, and more.

The second pillar is Automation for Patch Management. While the need for a patching policy is clearly not specific to Cloud Infrastructure, its importance is magnified in an environment where configurations evolve rapidly and automation is pervasive. And such automation would obviously make little sense without a well-defined policy. For this, you can use a tool like VMware’s vCenter Configuration Manager (VCM).

VCM handles three key aspects of patching automation:

  1. Reporting – i.e. verifying patch levels on selected groups of machines
  2. Checking for bulleting updates on vendor sites (e.g. Microsoft)
  3. Applying patches via automated installation

In a nutshell, VCM will automate both the detection and remediation of most patching issues.

However, one other key step is easily overlooked – upstream remediation. In a cloud infrastructure, we want to remediate not just the ‘live’ configurations, but also the templates used for provisioning. This will ensure that the future configurations being provisioned are also compliant. Before the ‘cloud’ era, administrators who identified a patching issue might make a note to update their standard builds in the near future – but there would rarely be a critical urgency. In cloud environments where new machines might be provisioned say, every few seconds, this sort of updates need to happen much faster.

As part of completing any remediation, you also need to be sure to initiate a procedure to carry out updates to your blueprints, as well as to your live workloads (see the simplified process view above).

You need to remember, though, that remediating the images will depend on different criteria from the ‘live’ workload and, depending on the risk, may require a change request and related approval. You need to update the images, test that the updates are working, and then close out the change request.

In sum, this approach reflects a consistent theme across Cloud Operations processes: that the focus of activity is shifted upstream towards the demand side. This also applies to Patch Management: remediation needs to be extended to apply upstream to the provisioning blueprints (i.e. images).

Key takeaways:

  • Policy and automation are two well-understood pillars of patch management;
  • A less well-recognized third pillar is upstream remediation;
  • Upstream remediation addresses the compliance and quality of future configurations;
  • This reflects a common theme in Cloud Ops processes: that focus shifts to the demand side.

Follow @VMwareCloudOps and @Moncassin on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.

How to Manage Your Cloud: Lessons and Best Practices Direct from CloudOps Experts

Rich Benoit, a Consulting Architect at VMware, and Kurt Milne, VMware’s Director of CloudOps Marketing, are experts when it comes to managing cloud infrastructures. But they didn’t acquire their expertise overnight. When it comes to cloud management, the process of transitioning can take time and leave even seasoned IT pros scratching their heads, asking, “What should I do first? How do I get started?”

Join Rich and Kurt this Thursday, December 12 at 10am PT as they share the fruits of their experience as cloud managers. This webinar will dive into tangible changes that organizations need to make to be cloud-ready, including how to:

• Introduce new, specialized roles into the equation
• Improve event, incident, and problem management processes
• Establish analytics to provide visibility into the cloud

Wondering what to do and how to get started with your cloud infrastructure? Register now to save your spot!

We’ll also be live-tweeting the event via @VMwareCloudOps – follow us for updates. Also join the conversation by using the #CloudOps and #SDDC hashtags. We look forward to seeing you there!

The Paradox of Re-startable Workflows: A More Efficient, Automated Process Does Not Always Mean Removing the Human Element

By: Pierre Moncassin

A chance conversation with a retired airline captain first brought home to me the paradox of automation. It goes something like this: Never assume that complete automation means removing the human element.

The veteran pilot was adamant that a commercial aircraft could be landed safely with the autopilot – but, he explained, contrary to what some people believe, that does not mean the human pilot can just push a button and sleep through the landing. Instead, it means that the autopilot handles the predictable, routine elements of the landing while the pilot plays the vital role of supervising the maneuver and reacting to any unforeseen situations.

We’ve seen a similar paradox at play in workflow automation situations faced by some of our enterprise customers. Here’s a typical scenario: A customer has deployed an automated provisioning workflow using VCO along with vCD and/or VCO. They have relied on VCO scripting to automate the provisioning steps so that end users can provision infrastructure just by “pushing a button.” As with the aircraft autopilot (though hopefully less life-threatening), the automated workflows work well until an unexpected situation occurs – there’s an error in the infrastructure, a component with a key dependency changes, or the key dependency itself changes.

This often means a failed workflow, and sometimes an error message that the end user struggles to interpret. After a couple of “failed workflow” experiences, the end user is quickly discouraged, user satisfaction plummets and…  need I say more?

Well, this is not what automation is supposed to be all about – We want maximum user satisfaction. The missing element here is an error recovery mechanism, one that very often involves human intervention. So how does that work?

One approach, in terms of VCO workflows, is to build in error handling into the workflows. It is not possible to predict all error situations, of course, but it is possible to detect error situations and issue an error message to an administrator; this at least enables the interception of the condition, which maybe simple to fix.

A second and more advanced part of the solution is to build modular scripts – that way you are fixing the problems once only and, of course, making your scripts more robust and repeatable over time.

The third part of the solution is to build re-startable workflows. This essentially means giving an administrator or process owner the ability to undo steps at any point in the flow. In the case of a straight-forward VM provisioning workflow, the solution might be as simple as removing the VM and automatically restarting the workflow from the beginning.

Or, it could be more complex – perhaps your resources have run out (maybe additional storage needs provisioning), or an issue arises with network settings. In these cases, you may need to troubleshoot before the workflow can re-start. But the point remains the same: A re-startable workflow gives your end users the best chance to complete their original request, rather than stare at an error message.

With error detection, you can roll back to the initial state and flag the error. Once the error is resolved, the administrator can either “resume” or restart from that known point with a known configuration, or at least no worse knowledge than you had before.

Crucially, all the error and exception handling is hidden from the user. That allows the request to complete (or to at least have a better chance of completing) – making for a much better experience for the end user.

It is up to the script designers to decide how much of the error they want to share with the end users – a decision that should be made with the administrator responsible for overseeing the process and responding to exceptions. The goal, though, is to keep end users happy and blissfully unaware of error situations as long as their request is satisfied!

To reiterate my original point: Despite the apparent automaticity of these resolutions, they will have been the result of human intervention along the way.

Finally, as a further step towards optimum organization, I recommend looking at the broader picture of governance around the cloud-related processes. How does the resolution team interact with the Service Desk, for example? Are there policies about when to re-provision instead of repair? Is there a specific organization to manage the cloud-based services? See our whitepaper “Organizing for the Cloud” for an introduction to optimizing the whole IT organization to leverage a cloud infrastructure.  But I digress…

In summary – if you are worried that workflow failures may impact your end users:

  • Build resilience in your VCO workflows and related scripts
  • Build in mechanisms to facilitate human resolution for unpredictable situations
  • Create re-startable VCO workflows
  • Identify a process owner who has responsibility and accountability for managing exceptions and errors

Thank you to my colleague David Burgess, who helped me formulate several of the key ideas in this post.

For more, browse our blog for some of our previous posts on automation, and join our upcoming automation #CloudOpsChat on 9/18 with Andy Troup and David Crane!

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.