APM Cloud Application Management VMware Aria Operations for Applications

Debug p95 latency with Distributed Tracing

Aria Operations for Apps(AOA,Formerly Wavefront) is a modern cloud native observability SaaS platform which offers all three pillars of Observability – logs, metrics, and traces. In this technical blog post, we will look at the most important pillar for observing and analyzing events or transactions which span services or infrastructure called distributed tracing. We will explain how we dog-food our tracing offering for AOA’s engineering and development efforts. Specifically, we’ll look at how we use traces for debugging p95th latency queries (the slowest requests that users experience) and investigate performance bottlenecks to optimize specific code paths to improve latency. This was done during an architecture change to how we store and query time series indices in the platform.

Pre-Requisites

What are traces?

Traces are meant to capture a single unit of work or a transaction of a service. This unit of work could be defined by the developer depending on observability requirements of the service. This transaction could span across many microservices, or make a 3 rd party API call to another platform or well within a single service having complex internal computationally heavy methods or database calls. Most of the messaging and use cases around Distributed Tracing are around troubleshooting inter service calls. However, it is as useful for tracking latency in terms of time spent within methods or errors encountered. For inter-service calls, Distributed Tracing uses context propagation to identify all spans (which is an operation within a trace) of a single trace. Aria apps Distributed Tracing supports open telemetry based context propagation.

In this article, we will explore the utility of tracing within a single service specifically for analyzing and optimizing the p95 part of latency.

How are traces different from metrics or logs?

Unstructured logs are the oldest pillar used for monitoring computer programs. Structured logs are a drastic improvement since they provide the ability to ingest, index on arbitrary key values (or tags which can be added to logs). They help answer questions around errors or log lines written by developers. Logs typically do not help with performance problems. For example, you cannot understand why a sample transaction is taking longer than other transactions. When you are analyzing latencies, developers ask questions like the following: Why did the p95 of a specific large customer take long last week compared to the previous week? In this case, logs are limiting. This sort of analysis requires transaction level deep visibility which can break down into specific operations (we call these operation spans in tracing) and how these contribute to the overall latency.

Metrics, on the other hand, can be really helpful in statistical measurement and monitoring of different aspects of a service. For example, counters (e.g., http 200 count) help maintain a running count and histograms help in understanding latencies (p50/95 latency of buyBook get http method). They are great for dashboarding and monitoring a service.

Traces play an important role when analyzing a sample transaction (e.g., http request). They provide visibility both at a high level as well as to specific operations that dominate the request’s latency. Tracing UI shows a high level flow diagram which visually represents the different operations that make up the transaction. This gives a high level glimpse of the breakdown of the request flow in a service. One can also understand the time consumed and see any custom tags added to each operation (span) within the trace.

Here is a sample of that(For a sample shopping portal app):

In the above screenshot, the sorting order is selected as “Longest First” which shows the slowest requests at the top. As an example, ShoppingWebResource.orderShirts which takes about 17.1s in a long running trace candidate appears at the top.

Traces View on Aria Apps

In this section, we will give some architectural context for Aria Operations for Application platform which enables its metric related workflows and use cases for querying time series metrics.

Moving index cache to an external distributed cache

AOA app indexing

The most important metadata created during ingestion of metrics are indices that enable time series queries using metric functionality, tag/labels or hosts/sources. As an example, indices which answer questions like:

Which tags were reported for different metrics?

Which metrics match certain tags?

These indices help with fast queries in observability workflows for any system and are critical for troubleshooting and analyzing data with all the special statistical functions that AOA supports on time series data.

Index Caching for Time Series Queries

The index caches old implementation uses in memory caching. This means every ingestion pipeline service and query service instance will cache all the indices locally for quick access during query time.

Implementation changes

We decided to move the in-memory caching to an external distributed cache like Redis to reduce the cost of horizontal scaling of the query services to start with. The query service will now use a cache read through mechanism during cache misses, fetching the most recent index from our index storage layer.

Using Distributed Tracing in performance troubleshooting

During development and performance regression testing, we run production-like queries of our customers to understand performance characteristic changes. This helps us understand how the architecture change affects the latency and correctness of query responses for time series queries on metrics.

Analyzing traces using trace views in Aria apps is an important aspect of understanding regression results.

Tracing instrumentation for Query planning code paths

Code paths in the query planning areas are the most interesting for query latencies. As mentioned before, query planning looks up these indices (not in redis) to decide which time series data match a given metric query. We turned on tracing instrumentation data to export the traces in this area for analysis after performance regression runs.

Intelligent sampling

AOA’s tracing storage uses a tail-based sampling strategy called intelligent sampling. The idea is to maintain a subset of representative trace samples for anomalous latencies (p50/95 etc.) for a request endpoint. This avoids storing (a cost benefit) and searching redundant traces while analyzing them. It also reduces the cost of storage. For more on this, please look here.

Problematic traces

Once we ran regression tests, we were able to look at the longest time consuming traces within the query planning phase of AOA’s trace view.

During a query regression trace view in a performance environment, we were able to zero in on a span in a problematic trace falling in the p95 latency range which automatically shows up in the trace view. As we expanded the specific trace, we can see the different query planning stage paths and where time consumed exceeded a second.

P95 for worst case queries

Here in this screenshot we can see a method called generateMetricHostPairs.solve() where we are consuming almost a minute (59 s) after moving indices to redis. As we investigated this legacy code area, we saw that we expand possible large loops for metric hosts.

To do that, we need to load the host and tags from the indices matching a metric to finalize time series we would fetch data points for. This means if there are 1000 metrics we would make 1000 look up calls to the specific index entries. In the older architecture, this would be an in-memory cache look up returning back in sub mill. In the newer architecture with redis, each call is in a round-robin call to redis infrastructure. This cannot be in sub milli-secs and is in millis (sometimes 100s). Once we understood this problematic area, it was clear what to do to attempt to optimize this high throughput external network call.

Fixing the high throughput calls consuming network times

Analyzing traces using trace views, we decided to optimize this code path by pre-adding all the indices we would need to load for indice look up and do a batch call (with a network call). This would send a single network call to load all the indices back (e.g. 1000) instead of making 1000 network calls to fetch one at a time.

Here is how the same query planning trace looked after the optimized code path was deployed and tested with regression:

As we can notice, large sized index lookups went from close to a minute to less than 3 seconds once we deployed a fix with batched redis calls.

Adding more custom instrumentation

With code areas in constant flux, it is important to add custom spans (operations). This makes requests going through these paths more observable during latency problems. Here is how we changed the “generateMetricHostPairs” part of the span to add a child span where we are aggregating index fetches into a batch fetch call to redis.

Trace shows the newly added method asyncGetBatch showing as its own span with the time taken

Conclusion

Distributed tracing adds deep and granular observability capabilities within a single service as well as to microservice oriented distributed architectures. This pillar of observability is the most important one for performance troubleshooting use cases. Distributed Tracing should be used in both production and pre-production stages of development. As the developer community understands and embraces this next pillar of observability which helps with debugging performance bottlenecks(and errors), Distributed Tracing will become the most used observability pillar. If you think that you have limited visibility on latency issues or are moving towards a more distributed architecture which can be made observable with distributed tracing, try a free trial here. You can easily get started sending spans using OTEL auto instrumentation agents to start with.