vRealize Operations vSOM

Why data granularity matters in monitoring

There are a wide variety of solutions on the market all claim to have various levels of machine learning, adaptive baselines, market driven analytics, learning algorithms etc… By and large they all collect the data the same way, poll vCenter for performance data via real-time APIs and on at the same time or less frequently poll all the other aspects of vSphere (inventory, hierarchy, tasks & events, etc…) and then store that data at varying levels of data granularity.

With every solution collecting the data from the same place, it should make it a straight out one algorithm vs. another, but it turns out that is not the case. There are two aspects that greatly influence the analytics – Frequency of polling and data retention.

Frequency of polling is pretty straight forward. Pull data faster equates to more data points and a better chance of catching peaks, valleys and general usage. However, with faster polling it comes with a cost of performance to poll the data every X minutes/seconds (on vCenter, the data collector and the solutions database) and a huge impact longer term on storing that data. Ideally, there should be some middle ground on collecting the data.

Most solutions poll every 15 minutes. Some of these can be changed down (good), and unfortunately many cannot (not so good). Those that can go lower, generally stop at polling every 5 minutes for vSphere. 5 minutes seems like an eternity to anyone focused performance monitoring and analytics. Fortunately, the vCenter API offers the ability to pull 20 second data for the last 5 minutes, which gets around most complaints. Pull 5 minutes (300 seconds) of history / 20 second point in time cycles = 15 data points.

One would think that is not hard to pull 15 metrics every 5 minutes, but every object can have dozens of metrics and properties to collect. In a smaller environment, this might be doable, but at large scale it can be enormous data sets to poll, forward and store. That data can expose weaknesses in the core platform of the solutions, and thusly 15 minutes or infrequent polling is enforced.

Even still, with all of that data, it has to sit somewhere and be analyzed. The algorithms need the historical context in order to consider ‘what might happen’ in 10 minutes or tomorrow or next week. What are the business cycles of a given application? How granular is it stored over time, and thus available for those super smart algorithms to analyze is key to answer those sorts of questions.

For a few, raw data is kept forever or a configurable amount of time, ideally long enough to analyze full business cycles. The unfortunate answer in most cases however, is it is not stored very granularly. Current data is fed in and stored at a highly granular state for a few days and rolled up over time to hourly or daily chunks. In practice this works well for analyzing short term spikes, but for anything longer term trends it will come with a harsh penalty.

Let’s take a look at this in graphical form. I’m going to use vRealize Operations to visualize the data below in the charts. In figure 1 we see the raw data coming in over the a period of a few days with 5 minute granularity. We can see the peak value of 106% Memory demand.

Figure 1.

Data Granularity set at 5 minutes
Data Granularity set at 5 minutes


Next if we take that same data and do 1 hour intervals, we see that in figure 2 we now have a memory demand of 57%. Some of the data granularity and importance is still there, but if we wanted to base this on peak usage we would be doomed already!


Figure 2.

Data Granularity set at 5 minutes
Data Granularity set at 1 hour


Lastly, if we look at Figure 3 where we have rolled up to daily intervals of data granularity. We have completely lost the peak and in fact it is looks to be in part of a trough of performance with only 32% memory demand for that day.


Figure 3.

Data Granularity set at 1 day
Data Granularity set at 1 day

The reality is anytime you take 180 samples per metric/property (60 minutes x 3 samples per minute = 180 samples) and roll them up, you are going to lose the importance. It doesn’t matter if it stores the minimum value, the maximum, the average etc… because regardless of which one you choose, there will never be the ability to go back and reanalyze that data, you will miss a peak, a valley, some importance. From then on the data will always be some sort of an average of an average and that makes all of the data correlation, capacity planning, forecasting and reporting suspect.

The last thing you want to base your next server purchase on is data that is averaged over long periods of time, you may end up under buying hardware and when those peaks come you will not be ready for them. The same goes for things like workload balancing, if your data historically is averaged, there could be times where you end up moving VMs multiple times vs. moving to the right place the first time.

Make sure whatever performance monitoring, capacity management and alerting solution you use has the ability to keep that granular data for full business cycles! These are just some of the reasons I continue to encourage customers to look at vRealize Operations. It provides all of the granularity and accuracy you need to perform your job and make the right decisions the first time.


2 comments have been added so far

  1. As a vendor of performance monitoring products, I have to agree with nearly everything you write. Track granular data, and keep it around for at LEAST a business cycle. And for revenue producing or customer facing applications, you might want to consider even more granular data. Our customers regularly monitor and keep 1 second data. A lot can happen in a minute. If even a few transactions per minute exceed your latency SLAs, your users will notice but you’ll be blind. It’s really useful to go back a few days or even weeks with granular data, for effective troubleshooting. Problems are not usually replicatable on command, but often are intermittent.

    And as you infer, it isn’t just granularity of the data. It is very important that you keep the fidelity of the information as data is rolled up. Keeping minimums and maximums, instead of just the averages makes a huge difference. And when I say “maximum”, it’s not the maximum of an average but a true maximum of every command. If you can look at data for the last business cycle, and see that the maximum response time for every command is within your SLA, you KNOW the system is healthy. Being able to show not only that the longest command was less than X, and 95% of all commands were less than Y is analysis that is key to understanding how things are really performing.

Leave a Reply

Your email address will not be published. Required fields are marked *