Management Packs Network Management vRealize Operations

Network Troubleshooting in a 3-2-1 Hardware Stack

By: Chuck Petrie, Blue Medora

Even as information technology trends work toward simplifying architecture, complexities within the hardware stack are still evident. IT teams are often faced with the daunting task of determining where the root cause of an issue resides and, more specifically, what is causing the issue. Today we will use vRealize Operations as a data aggregator to troubleshoot the network layer of a 3-2-1 hardware stack. For reference, the term “3-2-1” refers to a redundant architecture of 3 servers, 2 switches, and 1 storage array, or some derivative of that nature (2-2-1, etc). To begin the troubleshooting we will map the relationships of the 3-2-1 setup to determine which switch(es) and ports are associated with the stack. After we’ve mapped the relationships we will investigate performance issues associated with packet drops, network congestion, and over-utilization at the network layer.

3-2-1 Networking Troubleshoot - Figure 1

Figure 1 – Custom 3-1-1 Dashboard

Let’s start by understanding the 3-2-1 architecture in its most simplistic form. The example shown in Figure 1 uses 3 Dell servers, 1 Nexus switch, and 1 NetApp storage array. As we can see, the idea of 3-2-1 is simplistic, but when trying to determine the root cause of performance issues we find that additional components not listed above are part of this type of infrastructure.

3-2-1 Network troubleshooting - Figure 2

Figure 2 – Custom 3-1-1 Dashboard (complex)

To show how quickly a 3-2-1 architecture can become complex we’ve built out a custom dashboard as outlined in Figure 2. Inside of the stack we find that PSUs, Fans, Ports, Disk, etc. can all become the underlying cause of an issue. In our example today we’ve used vRealize Operations to further create the relationships through the infrastructure to determine which servers, switches, storage, and each of their components are related, and how each impacts the other.

3-2-1 Network Troubleshooting - Figure 3

Figure 3 – Cisco Nexus Switch Overview Dashboard (top portion)

After we’ve determined all of the components in the stack, we begin high-level troubleshooting at the dashboard level. By using the Cisco Nexus Switch Overview Dashboard we are able to to map the Nexus switch to the ports.  Looking at the right column of Figure 3, we see all of the alerts associated with the Nexus switch, and ports to quickly determine if any issues need attention.

3-2-1 Network Troubleshooting - Figure 4

Figure 4 – Cisco Nexus Switch Overview Dashboard (bottom portion)

On the bottom portion of the Cisco Nexus Switch Overview Dashboard as outlined in Figure 3, we are able to dive into specific metrics to determine the performance status of each port. By  selecting a specific port on the switch we see received, and transmitted statistics such as traffic, and packet discards.

3-2-1 Network Troubleshooting - Figure 5

Figure 5 – All Metrics List of Nexus Switch

In order to drill down further into what is causing performance issues at the network layer we will want to look at the all metrics tab. To investigate switch congestion we can pull up packet errors and correlate them with aggregated port traffic. This is shown in Figure 5, as we can see how packet drop errors relate to traffic throughput at the aggregated port level.  

3-2-1 Network Troubleshooting - Figure 6

Figure 6 – Nexus Capacity Remaining

Taking the troubleshooting a step further, we can pull up the capacity remaining of this switch to ensure that the memory and CPU usage has not exceed the available resources.  By expanding the capacity remaining tab we are able to see the trend of each resource.  At-a-glance we can see where we were, where we are, and where we are projected to be. Using the capacity badges we can easily determine if overconsumption is the cause of network latency. Furthermore, we can see when we will overrun the memory and CPU of the switch based on trends.  

Troubleshooting a 3-2-1 stack can be simplified by determining the relationships through the stack and by identifying specific thresholds of key performance indicators (KPIs). In our exercise today, we were able to show that using vRealize Operations as a management and analytical engine in conjunction with third-party management packs streamline troubleshooting processes within a 3-2-1 infrastructure stack. For more information or a free trial of the Management Pack for Cisco Nexus, visit the Product Page on Blue Medora’s website.