Management Packs Storage vRealize Operations

Troubleshooting Slow Virtual Machines and QoS Policies on NetApp Storage using vRealize Operations

By Bekah Suttner, Blue Medora

Virtual administrators on NetApp storage often receive complaints from end users about slow-running virtual machines when writing large amounts of data. This slowdown can often be the result of a QoS (Quality of Service) policy within the NetApp storage environment. In the following blog post, we will walk through how to troubleshoot virtual machines on NetApp storage. To troubleshoot this problem, we will use VMware vRealize Operations (vROps) and the vRealize Operations Management Pack for NetApp Storage to bring our NetApp storage environment into vROps. While QoS policies can be applied to a number of NetApp objects, the following use case will focus on a NetApp volume.

NetApp View 1

Figure 1 – A view of the write latency of a virtual machine in vROps

As the vAdmin, our first step after receiving a complaint about a slow VM will be to view the write latency for the VM in question. This allows us to understand just how slowly the virtual machine is functioning. As we see in Figure 1 above, this allows us to locate related objects, which will be important in the next step.

Once we have an idea as to how slow our virtual machine is moving, we can begin tracking down the issue. To do this, we need to view the datastore related to our virtual machine and check its write rate. Using the metrics brought into vROps, we notice an apparent cap on our write rate, which has leveled out at about 1,010 KB (or 1 MB) per second in this case, seen below in Figure 2. Seeing a cap this consistent is unusual, especially at such a low read rate. This serves as an indication that something may be placing a cap on the write rate of the datastore. To investigate further, we will need to navigate down to the storage layer.

NetApp View 2

Figure 2 – A view of the write rate of the related datastore in KB per second

In Figure 2, we see a volume associated with our datastore. We navigate to view the metrics for that volume and see that there is a QoS metric group in our environment. If there is no QoS policy in place, this metric group will not appear. We expand the metric group and pull up the metrics for the maximum throughput allowed on the volume by the QoS policy and for the percent of allowed throughput used by the volume.

NetApp Fig 3

Figure 3 – Metrics for maximum throughput and percent of maximum throughput used

We see in Figure 3 that our maximum throughput is 1 MB per second (the same amount we saw in Figure 2) and that our volume is using 100% of its maximum allowed throughput. Since the virtual machines are writing to this volume, we know that our slow virtual machine write performance stems from a QoS policy that is capping our throughput.

With this information, we can now improve the performance of the slow VM using one of two solutions. We can either contact our NetApp storage administrator and ask for QoS policy to be adjusted to allow for a higher write rate per second, or relocate the noisy virtual machines to other datastores on volumes that will allow a higher write rate. This allows Netapp administrators to maximize their storage use while allowing vAdmins to ensure optimal performance to the end user.

For more information on virtualization and cloud infrastructure, visit the VMware Solution Exchange and check out the vRealize Operations Management platform. For more information on the management pack used in this solution, visit the Blue Medora vRealize Operations Management Pack for NetApp Storage product page