In my last blog article, “What is Intelligent Operations Management in the SDDC” , I discussed how intelligent operations management of the Software Defined Data Center (SDDC) can be broken down into three (3) areas of focus: Topology Scope, Data Collection and Management Functions. We also talked about what each of these areas brings to the table and why they are important. Today I want to talk about what you would be missing if you tried to shortcut your SDDC management solution by not looking at all three areas.
While many vendors are trying to simplify SDDC management, some niche virtual machine (VM) management tools today are in fact OVERSIMPLIFYING it! They believe that by avoiding bottlenecks and optimizing the datacenter, there will never be a need for troubleshooting or root cause analysis. That customers should think of their datacenter as a black box and never attempt to understand why something happened and what steps can be taken to avoid it happening in the future. That there will never be a need to actually troubleshoot a problem. This is a very shortsighted approach…like sticking your head in the sand and hoping that nothing will break, or thinking what you don’t know won’t hurt you!
You see, not all problems in the SDDC can be solved by simply “adding more resources” or by “moving VMs.” Moving VMs to another server location may sometimes seem like a good idea, however, although if the problem is within the network, security, or storage areas for example, the problem will still plague the environment. And while initially it looks like the problem has been corrected, over time the same symptoms will appear. In fact, sometimes moving VMs can actually exacerbate the problem and cause an even greater service outage and impact.
Let’s look at some common problems where “adding more resources” or “moving VMs” would be of very limited help.
- Resource “Black Holes” – This is a runaway job that that could be anything from a process that is out of control to a poorly written query with too many nested loops. The problem is the more resources you throw at the issue the more it’s going to use. It’s a “black hole” of resource usage. Let’s take a deeper by using a 3rd example and look at the implications of a poorly indexed database table.
A poorly indexed database table can be performant when the overall size of the data is small. However, as the writes add up over time, a once quick and seemingly simple query can reach a level of computational complexity that no amount of additional CPU and memory can help. The entire database could be in-memory, and it would not matter. The CPU would be pegged, and any added CPU would be consumed. The only way to address such a problem is to at the core design of the database and its corresponding queries.
In the example above vRealize Operations would notice instantly that your memory and CPU were growing abnormally, and it would warn you that something is wrong well before it would hit a traditional threshold. That’s great, but that doesn’t lead you to see the problem with the index itself. The Management Pack for MSSQL however will! It would alert you of a “Missing Beneficial Index”. The management pack can also warn you when an index is mildly or highly fragmented, never accessed, or unused but maintained. In this case vRealize Operations will exceed many other management vendors by solving for the root cause of the issue rather than simply moving resources around.
- vSphere/vCenter Problems – Most issues within the virtual environment cannot be solved by simply adding resources. A really simple example would be vCenter authentication failures which would appear in the log files (e.g. why vMotions fail). Another example, in one customer’s environment they were able to identify a situation where a password had been changed, but the critical automated scripts had not been updated to use the new password. Gone unnoticed this would have caused the customer a lot of trouble.
To solve such issues vRealize Operations Manager and Log Insight provide a centralized system for vSphere log, fault and event collection across multiple vCenters is a must! This provides an ability to alert administrators to problems based on observed patterns of logs and events and provide recommended fixes so they can quickly be resolved.
- Storage Issues – In many larger organizations, virtualization admins are separate from the storage admins and may not have the necessary visibility into the storage layer. When storage problems arise, like high IO latency, the first thought might be moving VMs to different storage. However, without having a complete picture of the SAN environment you may move VMs and cause a bigger problem. The key to making the right decision with respect to storage is understanding what path those IOPS take to get to the backend storage. Perhaps the fibre channel switch port is overloaded, due to all hosts using the same paths or there is only a few SATA disks configured on the backend. There are also a lot of OTHER issues that won’t be fixed by simply moving VMs that also need to be resolved quickly before larger outages occur including:
- NFS Connectivity
- Issues HBA Link Failures
- Fabric problems and path issues (e.g. Some or all paths down on mounts because HBAs are zoned out)
- Physical drive failure (e.g. NetApp disk failures)
- Magnetic Disk has too many bad sectors
- SSD has critical media wear out issues
- Thin provisioned storage LUN capacity reaching critical levels based on thin provisioning comfort levels (not a physical capacity problem…yet)
The solution here is to implement the vRealize Operations Management Pack for Storage Devices which provides visibility into your storage environment and allows you to follow the path from a VM to the VSAN. With it you can quickly identify any problem that may exist along that path, gain global visibility across Virtual SAN clusters for monitoring and proactive alerts/notifications, get proactively notified on failures, performance and compliance issues and review recommendations for a remediation strategy.
- Network Issues – Network issues can come in many forms and flavors. Of course, just like with storage, having a solution that can peer into the network plane and help you make intelligent decisions is a must. But that’s just the tip of the iceberg! What about hardware problem, configuration issues and capacity shortages that have nothing to do with CPU, memory or disk:
- One or more PSUs are down on spine or leaf switches
- IP Header errors
- NTP configuration issues (e.g. systems pointing to the wrong NTP server)
- MTU Mismatch – MTU of one or more interfaces does not match the next hop router
- MTU Problems – Jumbo frames not enabled on all interfaces on the Leaf switch
- Identifying duplicate IPs
- IP Pool Exhausted (capacity example)
- VXLAN segment range has been exhausted (another good capacity example)
To manage this environment you need the Management Pack for Network Devices which extends the operational management capabilities of the vRealize Operations core product to the areas of physical data center networking. These extensions include data center switch discovery, data center switch health and performance monitoring, and vSphere object to object topology troubleshooting across the physical switch infrastructure and NSX logical connectivity.
- Compute Issues – Compute is a little easier to get your head around compared to storage and network, but that doesn’t necessarily make it easier. There are numerous ways your servers can fail you and cause you including:
- Advanced Programmable Interrupt Controller (APIC) errors
- Machine Check Exceptions (MCE) showing bad DIMMs
- Non-Maskable Interrupts (NMI)
- Power supply errors (e.g. Cisco UCS)
- Chassis temperature (e.g. HP Server)
- ESXi RAM disk / inode table is full
- Host unable to let go of file lock on virtual machine
- No syslog or incorrect syslog server configured for an ESXi host
Compute visibility and management can easily be added with a management pack solution from a partner like Blue Medora. These are management packs are built to manage specific hardware vendors like Cisco, Dell, F5, etc.
All of these issues and problems above can cause outages of various degrees and none of them will be solved by “adding more resources” or by “moving VMs”. So don’t foolishly ignore the true management needs of your SDDC by depending on an oversimplified niche management tool.
The vRealize Suite provides deep root cause analysis, troubleshooting tools and automated remediation across all parts of the SDDC including compute, storage and applications across both structured and unstructured data on both Private and Public cloud environments.
For more information on intelligent operations management for the SDDC check out our video pages here: VMware Feature Walk-Throughs
Or visit the product pages for VMware’s vRealize Suite.