Everyone in operations has received the call. An app owner phones the ops team and says the same thing, again. “My app seems slow, prove to me it’s not your fault.”

Responding to what feels like a simple question, from the app owner perspective, is a massive distraction and morale killer in every IT organization. Proving innocence is not the mission of IT, but it sure takes up a lot of time. In my own unscientific survey over the last three years, supported by my own support experiences, the average IT group spends 20-30% of their time answering the call.

Part of the trouble here is the app owners often believe if their application was in the cloud, they could simply look at an online dashboard and answer the question for themselves.

Seeing is believing, and visibility builds trust.

How has Operations Changed?

Let’s take a look at how operations have changed, and at some of the reasons why trust in IT has eroded over time. Ten years ago, best practices for supporting your workloads were: Design in Day 0 (Plan) and Provision in Day 1 (Build). Before switching into operations mode, multiple practice areas applied highly governed processes, policies, and controls.

Only after the best practices of server, security, network, OS, and app teams, in the plan and build stages, would production operations start on Day 2 (Run). Post-provisioning — if you were lucky — you would reclaim resources or, at least, over time, optimize the performance, availability, and cost at some future Day N.

Plan, Build, Run has been the basis of operational and organizational workflow and reporting structures in IT since the transition from mainframes.

Coming back to today, IT and cloud organizations are often faced with operational responsibilities that start abruptly in Day 2 – the operationalize and run stage – especially if they have decided to move to or build their applications in the public cloud. IT is being sidestepped at the design and provisioning phases, and they inevitably also faced with responsibilities for critical workloads that were born in the cloud. Additional public cloud risks and operational challenges magnify any existing issues, such as low visibility, increased spend, overprovisioning, scale, and performance inconsistencies, in addition to security and compliance concerns.

When thinking through your cloud delivery strategy, you have to ask yourself: Are you looking to continue with the Day Two model — where workloads go directly to the cloud and bypass classical IT controls and best practices — and then enter operations?

Can you embrace the direct to Day Two (Run) model and maintain control of performance, security, or even the cost of running the application? Or are you on another path where you need to restore Day 0 (Plan) and Day 1 (Build) and reintroduce policy governance, cost and performance management, as well as the design, build, and security best practices?

self driving operations

The Cloud Operating Model and Visibility

Consider what has made the public cloud providers successful. Operational teams and infrastructure planners have access to an intelligent platform that provides reliable insight and decision support on future issues. This includes hardware procurement, application, and infrastructure performance, in addition to reliability, compliance, and root-cause analysis. It is baked into the platform.

Let’s take the LoB app owner’s side for a moment. Why do they keep making the call? If we are fair, we have to recognize that their cloud and mobile app experience has trained them to expect that they should be able to see for themselves what is going on. The LoB has also learned, through experience, that the app might be slow because of the infrastructure, and if they ring all the bells early, they can get the operations team working ahead of the problem.

What teams often don’t know is that the ‘how to’ of resolving the ‘prove innocence’ dilemma has been solved.

Five years ago, it took an expert user of vSphere and vRealize Operations to respond to the call. If an IT professional went to the training, used these tools regularly, and was a proficient sys admin generalist, they could handle the call easily. There were not many of these rock stars, and they were often the precious escalation resource. Not the first person that you would get on the phone.

Today the platform will do most of the heavy lifting when resolving the ‘prove innocence’ problem.

Here’s a quick run-down:

  1. vRealize Operations will automatically discover the key applications already running in your environment and create an easily understandable view of these apps.
  2. Everything essential to your application that is connected in the vSphere environment is identified and mapped. The critical path of infrastructure that uniquely runs your Java, .Net, JBoss, Oracle, SAP, or Spring app is instantly visible. This includes servers, OS’s, blades, storage, hosts, fiber channel, network, etc. The list of discoverable apps is far longer than what I have put here, see more HERE
  3. vRealize Operations creates a dashboard specific to each App Owner’s favorite application, that can easily be customized, and it highlights the underlying critical path of infrastructure. Using drag and drop widgets, you can build a meaningful view of the health and performance of the application and send it directly to the app owners. High-level dashboards can interact with lower level, more detailed views. 
  4. Troubleshooting, if necessary, is radically different from even two years ago. Everyone who has ever called support has dreaded the situation where you finally get the senior resource on the phone who diligently works the problem, and then they’re gone. Vacation, shift change, or just the weekend interferes. vRealize Operations has introduced a Troubleshooting Workbench to accelerate resolving issues. Workbench spaces that suggest resolutions and track root cause exploration can easily be shared between administrators. VIDEO

Smoke chasing, fire suppression, and firefighting are the bane of both IT and the application owner. Starting from scratch, yet again, when responding to the call, without having visibility to critical performance and health indicators in advance, will severely undermine trust in the operations team. The LoB wants to trust IT, and IT wants to trust their platform. VMware’s vRealize Operations radically increases app owner confidence by giving them a public cloud experience, whether the workload is in the data center or in multiple public clouds.

vRealize Operations builds on this new level of visibility by leveraging comprehensive platform automation and machine learning to prevent issues before they happen. This includes built-in root cause analytics, performance tuning, compliance remediation, issue alerting, and intelligent capacity management. This is what we are referring to with the phrase, Self-Driving Operations. See More

Embracing ‘Day Two’  – “We are ‘Cloud First,’ but nothing’s up there.”

The App Owner who has been trained by their public cloud experience to regularly make the call may also look to a ‘Cloud First’ initiative to solve their problems. But some ‘Cloud First’ mandates turn out to be impossible. It is one thing to consolidate legacy data centers with existing high-speed connections. It is another thing to up and move your existing applications to the cloud. It is the ultimate Day 2 challenge. Many CIO’s are faced with corporate mandates to get to the cloud or get out of the data center, or a just little bit of both, with running production applications.

These efforts often stall when the Chief Information Security Officer (CISO) steps in and halts the project. No CISO will approve the changes required to move sensitive applications and data if the security implications are not well understood.

One of the game-changing offerings in the VMware portfolio is vRealize Network Insight (vRNI). It is astounding how difficult it can be to securely migrate applications. There is a simple reason. Network and security teams struggle with accurately assessing the current configuration of their environments. For instance, firewalls can have thousands of rule base entries describing source, destination, port, ACLs, and other security attributes. The most honest answer I’ve ever heard as to why some organizations have so many (often redundant) rule base entries is that no one knows what they all do.

Add to that all of the configuration details bound up in physical routers, switches, load balancers, and their virtual equivalents—no wonder the CISO doesn’t want to change anything. The risk is too high.

What you may know about vRealize Network Insight (vRNI) is that it automatically discovers how an application behaves over the network, allowing teams to introduce advanced security functionality like micro-segmentation. What is game-changing is how it supports the Cloud Operating Model. vRNI identifies the policy elements across the full range of physical and virtual network components that the application is actually using. The platform will then recommend improvements to these configurations to increase performance and security.

With this level of visibility, Network Insight creates the confidence necessary across multiple teams to implement the required policy changes to move applications as part of data center consolidation or cloud migration. It is a ‘what if’ engine for change. This is how the CIO and the CISO can beat the ultimate Day 2 challenge.

Embracing ‘Day Two’  – “But I’m already in the cloud, now what?”

For LoBs or developers who have already, and continue to, deploy and run directly in the public cloud, VMware’s CloudHealth offering can identify pre-existing workloads that began in Day 2 mode. These workloads can easily be discovered, managed, and optimized. CloudHealth’s market-leading analytics provide recommendations on which cloud workloads are good candidates to migrate to your Private Cloud or other providers, services, or contracts based on cost savings, performance, security, or compliance criteria. See More

What about Day 0 (Plan) and Day 1 (Build)?

Not that there was anything wrong with Day 0 and Day 1. Design, policy and cost best practices are still critical.

In ‘Cloud Is’ – Part Three – Control, we’ll look at how the Cloud Operating Model restores the Day 0 and 1 steps, without slowing down the developers or impeding the outcomes needed by the LoB.


The full ‘Cloud Is’ series is available here: https://blogs.vmware.com/management/author/rquerin



Leave a Reply

Your email address will not be published. Required fields are marked *