I’m fortunate.  I get a lot of opportunities to meet with customers to discuss all of the new technologies and products that can help them build a more dynamic and flexible software defined data center (SDDC).  We will go around the table talking about how these innovations can make their lives better and without fail at some point someone will ask, “How am I going to manage this whole thing”?  That can sometimes take the air out of the conversation and they will often start looking around the table at each other questioning how to move forward.  This is where I will grab a marker and head to the white board and start mapping out intelligent operations management for them.

Intelligent Operations Blog

While it’s true “operations management” means something different to each person, the capabilities needed to properly monitor and intelligently manage an SDDC is not really a trade secret.  SDDC operations management can be summarized into 3 areas of focus: Topology Scope, Data Collection and Management Functions.

Topology Scope

The first thing to realize is managing the SDDC is not just about managing virtual CPU and Memory anymore.  In the early days of virtualization all you needed to worry about was virtual compute and that was already difficult enough to manage.  But in today’s SDDC that’s very short sighted.  You need coverage for each topology layer and a way to understand the relationships between them in order to do proper correlation and resolution.  If you only look at one part of the infrastructure or don’t have a way to correlate problems across the different infrastructural topologies you will miss issues when they arise, and be notified of problems only when the users call in to complain, resulting in non-stop fire drill mode.  SDDC operations management needs a solution that has more complete coverage to include:

  • Physical Compute (Dell, HP, etc.)
  • Virtual Compute (ESXi, Hyper-V, etc)
  • Physical Network Devices (Cisco, Juniper, etc)
  • Virtual Networks (VDS and NSX)
  • Physical Storage Devices (EMC, NetApp, etc)
  • Virtual storage (Datastores and VSAN)
  • OS and Applications
  • Public Workloads running in AWS, Azure, vCloud Air, etc.

Data Collection

Next we need to gather the data that will help us make intelligent operations management decisions.  There are a lot of different types of data that can be collected, but most of them fall into these 5 categories:

  • Metric or “structured” data collection is obtaining data from devices and element mangers on a specific time cycle (e.g. every 5 minutes collect the many data points for CPU usage of all ESXi hosts). This data is the bread and butter of management.
  • Logs or “unstructured” data collection is VERY important as numerous problems and issues will only appear in device or application log files.
  • Faults are critical issues that can be collected in many forms. These are mostly associated with hardware problems, but can be for other components as well (e.g. vSphere fault events).  Some faults come in logs (covered above), but others can come from connecting to element management systems or even in the form of SNMP traps.
  • Properties or “configuration setting” collection is necessary to understand the current state of things. This includes not just simple configurations like knowing the current Limit settings on a particular VM but also security setting that can affect the overall safety of the environment like leaving the DCUI service running.
  • Change awareness is the collection of changes in the environment as these changes are often the root cause of the issue. Changes will usually come in the form of monitoring configuration settings or from log entries.

Management Functions

Now that we have resolved 2 of the 3 areas of focus for operations management we can move onto the MEAT…that being the functions needed to ensure things are running smoothly.  Obviously, I talk about this one last because it’s built upon and leverages the topology and data collection we discussed previously.

Not all problems in the SDDC can be solved by simply “adding more resources” or by “moving a VM.”  That’s a very shortsighted approach.  You need visibility into the numerous other issues and a way to troubleshoot problems when they occur.  In many cases moving VMs around ignores/masks the underlying issue/root cause which then goes unresolved leading to a hornet’s nest of VMs moving back and forth wasting valuable resources.  I will have a follow up blog that will provide a deeper dive into this topic as it catches many customers, but for now let’s look at the management functions:

  • Automation is key! The ability to FIND a problem is great, but it’s 100 times better if the management solution can just fix it for you by invoking an action.  It’s faster.  It’s more efficient.  It’s less stress on you.
  • Basic performance monitoring is probably the most commonly known function as most operations management solutions have some form of it. “Run this action when CPU goes above 85%”, is a useful albeit somewhat limited way of managing.
  • Identifying abnormalities is an advanced form of performance monitoring. Here you are not using hard thresholds to determine performance but intelligent analytics that learn behaviors and lets you know when something is acting abnormally.  Nothing is going to make your phone ring quicker with a complaining user than when something “isn’t acting right”.  That’s why “Run this action when the CPU is abnormally high” is a much better way of managing.
  • Capacity management in the SDDC is a base but broad management function. The ability to know when a component is out of resources and automatically add more is important, but is pretty basic.  You also need to understand the capacity trends to predict when capacity may be a problem in the future and avoid potential upcoming stress.  The opposite is also true.  Reclaiming resources when they are not needed will also help avoid contention in the long run.
  • Change management is the ability to be cognizant of changes in the environment and make intelligent decisions on them as needed. Changes will usually come in the form of monitoring configuration settings or from log entries.  Two things that you need to know about an SDDC 1) its common knowledge that a high percentage (a majority) of the problems in any environment are going to be caused by change and 2) an SDDC is dynamic…its all about constant change .  So bottom line is change management will be key.
  • Log management may seem like it was covered above, but the ability to collect many TBs of data is very different from making it useful for you. You need a tool that is able to find the important messages that need to be escalated (the needle in the haystack), look for log trends and have the ability to display logs in a way that will help you to understand them.
  • Compliance is a VERY big word and I don’t want to over complicate things with it here. In this case compliance means the ability to see if the SDDC conforms to not only industry best practices but also to your desired configuration and security posture.  If something is unsafe or harmful you need to be aware of it and resolve it fast before it causes something major.
  • Troubleshooting and firefighting is still going to exist in even the most automated SDDC environments. Yes, the goal is to eliminate as much of it as possible, but not every issue will be associated with a simple alert or solved by automating an action (e.g. moving the VM).  There are times that YOU WILL need to get your hands dirty.  That’s why you need a solution that can allow you to quickly search log entries, view current and historical metric values, bubble up HW faults, see recent changes, understand capacity, etc., all in one place.  You know, troubleshooting an SDDC is actually a good topic for a follow up blog – for another day!

Here is a link to our video pages where you can learn more about on Intelligent Operations Management for the SDDC.

Or visit the product pages for VMware’s vRealize Suite.