Better Visibility with New vSAN Metrics in vR Ops 7.0

Infrastructure analytics helps to provide meaning and insight in the effort to make smarter decisions in the data center. Yet it can only do so if the right data is measured, from the right location, and in the right way. Infrastructure analytics has historically been difficult in part to these three data collection variables being inconsistent. It is one of the many reasons why vRealize Operations (vR Ops) excels at data center visibility and infrastructure analytics. vR Ops provides a common monitoring and management substrate that fetches the right data from a common data collector: the hypervisor.

vRealize Operations and vSAN continue in the march forward to provide all new levels of intelligence. The introduction of vR Ops 7.0 has given data center administrators running vSAN full awareness of vSAN stretched clusters. The improved visibility doesn’t stop there. Let’s look at a few more improvements introduced into vR Ops 7.0 that will be of interest to administrators running environments powered by vSAN.

Resynchronization Status

The latest edition of vRealize Operations improves visibility into resynchronization activities. One way of doing this is to provide a simple status indicator that allows for at-a-glance checks to see if resynchronizations are currently running on a cluster. Resynchronizations are a per-cluster activity, and as Figure 1 shows, an administrator can quickly see the status on the respective clusters.

vR Ops 7.0

Figure 1. Resynchronization status of multiple vSAN clusters

Resynchronization Burn Down Rates

While resynchronization status indicators are nice, they are limited to the current condition of the cluster. vR Ops 7.0 now provides burn down rates for resynchronization activity over the course of time. Measuring a burn down rate helps provide the context in a way that can be difficult to understand using simple resynchronization throughput statistics found in the vSAN performance service in vCenter. A burn down graph for resynchronization activity provides an understanding of the extent of data queued for resynchronization, how far along the process is, and a trajectory toward completion. Most importantly, it measures this at the cluster level, eliminating the need to gather this data on a per-host basis to determine the activity across the entire cluster.

vR Ops renders resynchronization activity in one of two ways: 1.) Total objects left to resync, and 2.) Total bytes left to resync. A good example of this is illustrated in a simple dashboard shown in Figure 2, where several VMs had their storage policy changed from using RAID-1 mirroring to RAID-5 erasure coding.

vR Ops 7.0

Figure 2. Resynchronization burn down rates for objects, and bytes remaining

When paired together, the “objects remaining” and “bytes left” can help us understand the correlation between the number of objects to be resynchronized, and the rate at which the data is being synchronized. For example, in Figure 2, we can see that vSAN will resynchronize multiple objects concurrently, and reduces the amount of data to be resynchronized at a near-linear rate. As it completes the resynchronizations, and there are fewer objects to be resynchronized in parallel, the rate of reduction (in “bytes left”) slows slightly, which is expected behavior. This slight reduction in throughput would largely be unexplainable if one were to only use the resynchronization throughput statistics found in vCenter.

Observing rates of completion using these burn down graphs can also be used to better understand how Adaptive Resync in vSAN dynamically manages resynchronization rates during periods of contention with VM traffic. These charts could be easily combined with VM latency graphs in order to see how vSAN is helping to prioritized different types of traffic under these periods of contention.

Burn down graphs can provide insight when comparing resynchronization activities at other times, or in other clusters. For example, Figure 3 shows resynchronization burn down activity occurring over a larger time window. Here we can see that the amount of resynchronization activity was very different during the periods that resynchronizations occurred.

vR Ops 7.0

Figure 3. Comparing Resync activity – Viewing burn down rates across a larger time window

The two resynchronization events highlighted in Figure 3 represent a different quantity of VMs that had their policies changed, which is the reason for the overall difference in the amount of data resynchronized.

VMkernel Statistics

Ensuring proper visibility to network connectivity is important, especially in hyperconverged environments. vR Ops 7.0 introduces visibility into the same VMkernel statistics that can be found in the vSAN performance service. As shown in Figure 4, one can create dashboards that track packet loss rates, packets per second, or traffic throughput on a VMkernel level. To understand why packet loss rates are so critical to your environment, see “Reliable Network Connectivity in Hyperconverged Environments” on StorageHub.

vR Ops 7.0

Figure 4. New VMkernel statistics available in vR Ops 7.0

While these new metrics introduced to vR Ops 7.0 are not yet incorporated into the four built-in dashboards included in vR Ops Advanced and above, users can easily create custom dashboards that incorporate these new levels of visibility. As a reminder, custom dashboards that you or others create can also easily be shared with others by uploading them to the VMware Sample Exchange. If you or your team has created a great dashboard, the Sample Exchange is a great location to share this with others.

Conclusion

VMware continues to make great progress on improving the levels of integration between vSAN and vR Ops. Yet the benefit of these improvements goes far beyond any discrete metric introduced. The power comes from combining the right infrastructure data in a way to offer all new levels of insight to help you see what you might otherwise be missing. As your environment and topology evolves and incorporate the edge, the core, and the cloud, having the right foundation in place for optimal visibility, and infrastructure analytics is the key to agile, smart decision making in the data center.

@vmpete

Resynchronization Status

Resynchronization Burn Down Rates

VMkernel Statistics

Conclusion

Related Articles

VMware vSAN Max: Petabyte-Scale Disaggregated Storage

Hyperconverged Infrastructure (HCI): A Game-Changer for Modern Enterprises

VMware named a Leader in The Forrester Wave™: Hyperconverged Infrastructure, Q4 2023