The introduction of the vSAN performance service gave administrators an easy way to view key performance metrics related to vSAN. First introduced in vSAN 6.2, and built right into the vSphere web client, the performance service provides, a simple, integrated way to look at current and historical metrics of all vSAN related activities. This continuous collection of metrics is important, as it provides the proper historical context to identify steady state behavior and performance abnormalities in an environment. Run-time tools like vSAN Observer were not well suited for a continuous gathering of data.
As the sophistication and feature set of vSAN grows, so do the capabilities of the performance service. These ongoing improvements will help the performance service become the primary source for gathering vSAN performance metrics, for viewing in vCenter, or in other applications that use APIs for integration, such as vRealize Operations. This post looks at the additional set of metrics introduced in vSAN 6.6 specific to resynchronization activity, and why it matters. We'll also show how integration with vRealize Log Insight can be used to your advantage to review resync activity across a broader time period.
The performance service is unique in that it presents metrics for backend storage system related activity as well as metrics related directly to I/Os for a VM. An example of backend activity would include the resynchronization of data that comes as a result of balancing the storage system, or perhaps the rebuilding of objects in order to meet policy compliance. See "Intelligent Rebuilds in vSAN 6.6" on StorageHub for more information on the types of backend activity that might occur in a vSAN environment.
New levels of granularity
Performance information on resynchronization activity existed in past editions of vSAN, but only at the host level, under "vSAN - Backend." While useful, the performance service was unable to render resync activity at the disk group level. Monitoring resync activity at the granularity of disk groups is important for many reasons.
- Accurately identifying the source and target of resync activity. With granularity at the disk group level, you can distinguish where reads are coming from, and writes are going to. Previously, with hosts using multiple disk groups, viewing resync activity at a host level could show reads and writes, but did not distinguish what disk group was demanding that I/O.
- Distinguishes activity when adding multiple disk groups. It is quite common for additional disk groups to be added to scale up capacity or performance inside of a host, but measuring at the host level would always generate aggregate statistics of the host, and obfuscate helpful information about resync activity of a specific disk group.
- Visibility of resync activity among asymmetrical disk groups. Easily identify if the performance of resync activity occurring on a disk group using fewer drives, drives with less capacity, or perhaps drives with lower performance specifications than other disk groups.
With vSAN 6.6, resync IOPS, resync throughput, and resync latency can all be tracked at the disk group level. As shown in Figure 1.
Figure 1. Resync IOPS and throughput at the disk group level.
Note how vSAN is able to distinguish between the different types of resynchronization traffic, and present the performance statistics accordingly. this resync data is broken into categories.
- Policy Change Read
- Evacuation Read
- Rebalance Read
- Repair Read
- Policy Change Write
- Decommission Write
- Rebalance Write
- Repair Write
Not only does the performance service present IOPS, throughput, and latency for resync data, it breaks it down by the type of resync activity that is occurring on the disk group, as well as if they are reads or writes. Reads and writes can impact storage differently, and is especially important now that resync data can be viewed at the disk group level.
The performance service in vSAN will allow a user to view performance data for the previous hour, up to a maximum of 24 hours. What is not entirely obvious is that in most cases, the performance service will retain this data for up to 90 days - visible by using a custom defined time range, but no greater than a 24-hour time window. Limiting the viewable window to 24 hours minimizes resource usage, while the 90-day retention period provides flexibility in looking at past performance behaviors.
Using vRealize Log Insight and the vSAN performance service to look at resync activity
When looking at the performance of resync activity over a period of time, how can a time window of interest beyond the previous 24 hours be easily identified? This is where Log Insight comes into play. Not only does Log Insight show a visual historical record of resync activity, but it can also provide context as to why that resync activity might have happened. Log Insight is limited to reporting events as a result of log entries, and does not have an understanding of actual resource usage. But for this purpose, it is a great tool to help determine time periods showing activity of interest.
As shown in Figure 2, Log Insight paired with the content pack for vSAN will present an "Object component state - resyncing" widget found in the Object Events dashboard. Figure 2 shows resync activity over the period of one week, with resync activity occurring on host ESX04 (in yellow) on June 2nd.
Figure 2. Resync activity reported in Log insight dashboard - 1 week view.
The "Interactive Analytics" view in Log Insight can be used for viewing more detail. As shown in Figure 3, the time window was reduced to a 12-hour period on June 2nd. This clearly shows resync traffic between the hours of 8:00am and 10:00am. Note that the Y-axis on this chart quantifies how many components are resyncing. It does not indicate the size of the component, and thus, the effort that is needed to resync it. The "effort" will show up in the form of "Resync IOPS" and "Resync Throughput" when we look at this resync data in the vSAN performance service.
Figure 3. Log Insight Interactive Analytics - 12-hour window.
The next step will be to correlate the resync statistics shown above in the performance service in vCenter. In this case, it can be found in vCenter under host ESX04, and by clicking on Monitor > Performance > vSAN - Disk Group. The time window will be changed to reflect the same 12 hour time window defined in Log Insight. This view will present various resync metrics, as shown in Figure 4. What these metrics help us to understand is:
- The type of resync activity that was generating load.
- The amount of resync activity (in the form of IOPS, and throughput).
- The I/O type (reads or writes).
- The latency of those resync I/Os.
- The overall duration that the resync activity.
As shown in Figure 4, the resync activity identified in Log Insight stems from "Rebalance Reads" on host ESX04 for the period of about 90 minutes, with peak IOPS around 410, and peak throughput at around 25MBps.
Figure 4. Statistics of resync activity after identifying activity in Log Insight.
Sometimes the resync activity as shown in Log Insight can be very different than the resync activity as shown in the performance graphs. This is because they are measuring very different things: Logged events, versus movement of payload. In practice, the resyncing of large components will show up much more in the performance graphs than the resyncing of small components. Using the tools together can help you understand this activity more clearly.
Adding resync activity at the disk group level improves the usefulness of the performance service by more accurately identifying the type of resynchronization traffic that is occurring, and what it might be impacting. These metrics can be easily correlated with congestion metrics, or latency metrics on VMs, to see if resync activity is having any real impact on the running workloads. In future posts, we will look at the other new metrics included in the performance service in vSAN 6.6.