VMware Aria Operations for Applications

High Cardinality queries matching millions of series with Aria Operations for Applications

As software teams go through modernization, moving towards microservices or platform deployment model change from deploying VM to containers is inevitable for all the advantages. But, with this change, observable metric data has grown exponentially for customers. As an example, where an SRE would monitor, 10 VM level CPU usage metrics, now has to monitor 100 (10 * 10) containers(when an application is deployed as 10 containers) which makes it 100 CPU usage metrics. With the ephermal nature of container deployments, the number of metrics matching a time series metrics query can run into the millions from what we have seen with our customers.

In this article, we will help you make sense of cardinality – when it is important and when it is not. Additionally, we will also explore how high cardinality is an aspect of observability to be managed and optimized based on use case. But, in cases where it is needed, Aria Operations for Applications(formerly Tanzu Observability) shines with its patented and propretitary query planning and indexing solutions. It can allow queries matching millions of series in a latency period of seconds! (We will showcase that with data)

What is cardinality of metrics?

Definition

Cardinality is the size of the set of unique combinations of ( product) of metric, tag name and tag values observed in a period. Or, put it another way it is all possible combinations of unique tag/label values. In Aria Operations for Applications, this also includes the host / source name. As an example, of a chart illustrating this idea, this chart is monitored by platform’s oncall for reported ingestion points per sec emitted by ingester services.

The above screenshot shows the plot of a single metric with its individual time series that the query matches to. Each line shows a unique time series emitted by a single replica of our dataingester service. Source column represents host (AWS ec2) names for this service is horizontally scaled to. To be clear, one time series query would resolve to this to many series(tag value) and host matches for data stored in our telemetry db. The above query’s response resolved back to these cluster(our internal deployment model) tags and hosts(individual ec2 instance hosts) which reported this particular metric:

Hopefully, this should give an idea of the unique series reported for this query for the metric reported.points. Evidently, this is a combination of metric name, host(ec2 instanceId) and tag key values (cluster=customer1, service=dataingester).

When is cardinality useful and when not ?

Let us take an example use case. A SRE persona consciously wants to monitor system level metrics like cpu.usage and as another a http success count at service level. Consequently, maintaining tags at service=checkout or pod= <podID> makes sense. But, using pod level tag for http.200.count does not makes sense. SLIs like success count for requests are more important to monitor at service level. Not at a per pod or container level. So, cpu.usage with tags pod=podname and http.status.count with tags service=checkout is the optimal tagging for these two use cases.

In our specific illustrated metric reported-points(above), tagging at service level allows us to add query filter for that specific service at query time. So, we can see the overall throughput pps across the horizontal scaled out data ingester service nodes using service tag filter. Additionally, we can also understand throughput as at each node(or host) level. This gives us ability to monitor if the throughput is evenly handled by all the scaled out nodes.

To revisit the first hyphothetical example, high resolution tagging is cpu.usage monitoring at pod level. Here cpu.usage at a pod level makes sense. But, anti pattern would be maintaining http success count at pod level which can add unnecessary time series during query time by that service name.

High level guideline When doing a needle in haystack troubleshooting or proactive monitoring, users should think hard about when would one need a series at service level, or deployment level (pod, host etc) or some other dimension and the cardinality implications at query time based on query use cases.

User level tags(userID = xyz) series is typically not encouraged for metrics pillar. This would lead to cardinality explosion with thousands or millions of tags value combinations especially if we include all other tags coming in with userId.

We encourage customers to audit their tagging patterns periodically for cleaning up any sub optimal tags. We have noticed with our customers, these tag values creep and explode over a period of time as more apps/ infrastructure gets deployed with a potential high cardinality tag which was not as bad for when it was introduced.

Cardinality’s cost

Storage

Each host and key value of a tag emitted by source application will occupy storage space with its own data point stream. More specifically, in the above illustration we added a tag called ‘cluster’ and ‘service’. Each combination of cluster and service value emitted for metric reported-points will occupy storage in the tsdb(time series database). However, they are compressed using contemporary delta of delta encoding. But, the overarching point remains.

Query performance

The metrics system maintains different families of indices. To elaborate, metric to tags, source and reverse indices are examples. For example, we can lookup all metrics reported for a tag value (e.g. in the above illustration service=dataingester). This is done using reverse indices. Or, we can look up hosts reporting a metric. These indices are cached heavily in main memory for fast query planning. Said that, if a query matches millions, there is a higher chance we get a cache miss and have to do a database load on the indices. Hence, it can affect query latency.

Query Cardinality Limits

Aria Operations for Applications’s hard limits on this index size per query is about 10Million during query time currently. Though our experience shows hitting that much cardinality usually means that our customer has some bad tags (or labels) in their application being monitored (even in container environments). Or, the query is not providing enough tag filters for narrowing time series search. And, the user has to add more filters.

How much cardinality does Aria Operations for Applications support ?

What are query keys?

Query keys are the closest primitive to the unique time series in the the metrics storage layer’s data model. At the end of query planning phase for time series(ts in wql) queries, the output is a set of query keys. There query keys are candidates to be scanned from our telemetry storage for applying aggregation functions like percentile as an example. A query key is an abstraction in our code base which defines a unique metric, host, and a set of unique tag key value pairs. The total number of unique query keys is the total cardinality for the given query.

Cardinality spread of production queries

The empirical data we have from our production queries is the best way to understand how much cardinality that Aria Operations for Applications supports for metric queries. Most average queries users are hitting about 1000s of series.

Here is a screenshot of one of our largest and demanding customers with high cardinality queries. This histogram shows their query’s cardinality distribution in production. At their peak, they hit about 4 million query keys.

In our internal load tests, we have plotted how large a single index (for example , a tag index mapping to all the metrics matching it is an index in our system). Here is a point plot of that spread :

This shows average of our load test in the millions in the worst case for index look ups and the highest levels going just above 5+ millions. These are queries still passing our regression tests within the worst case 3 minute time range for queries for timeouts. Additionally, keep in mind this plot is the size of cardinality for just one index. Queries can scan more than just one index from the index caches. These can go into 1000s of indices being loaded to satisfy a given ts query. Most of them do.

Proprietary or patented areas of query supporting high cardinality queries

  1. Dynamic query planning The dynamic query planning allows for large optimizations by dynamically routing the indice lookup to the optimal ones. This helps narrow the time series that would match the query very fast. Specifically, for different query patterns, it might consult different sequence of index caches to reduce the time taken to narrow query keys it finally arrives at.
  2. Streaming architecture The query engine is built to stream data all the way from telemetry storage to front end app using Server sent events(SSE) protocol. This gives a close to real time experience with blazing fast responsive queries even with large cardinality queries.

For more on this topic, check out this video.

What is coming up with metric pillar?

As we have experienced our customers’ pain points, we see a need for helping our customers in understanding their tagging patterns and cardinality shapes. This will help them troubleshoot slow queries as well as keeping their tagging pattern efficient by avoiding bloated tag problem. We are envisioning and building this cardinality management tool next. We will keep the next blog in this eries updated as we bring this to our customers.

Until then, check out Aria Operations for Applications’s metrics/ tracing / logs offering here.