This post was co-written by Gareth Clay, Senior Software Engineer at Pivotal
The open-source Spring Cloud Netflix library provides Spring applications with a robust implementation of the Circuit Breaker pattern, through Hystrix, Netflix’s latency & fault-tolerance library. In the Spring Cloud Services team, we have in previous versions of the product offered support for Hystrix circuit breaker metrics visualization, with a fully Pivotal Platform-integrated Circuit Breaker Dashboard. Since its first release, Circuit Breaker Dashboard has itself been based on Netflix’s Hystrix Dashboard.
Recently, Netflix went “all-in” on Spring Cloud and put some of their projects, including Hystrix, into maintenance mode. This means that no new features will be added and fixes will only be made for blocker bugs and security issues. Also, the Hystrix Dashboard has some known security issues and was moved to the Netflix skunkworks GitHub organization to emphasize that it is no longer being actively developed.
Building a Replacement Circuit Breaker Dashboard
With the Hystrix Dashboard being retired by Netflix, and the incubating Spring Cloud Circuit Breaker project opening up new circuit breaker implementations beyond Hystrix for Spring application developers in the future, we're taking an active decision not to provide the existing Hystrix-based Circuit Breaker Dashboard in Spring Cloud Services v3. We’re very mindful of the fact that this leaves our users without an out-of-the-box visualization solution for Hystrix circuit breakers as was available in previous versions, so in this blog, we’ll explore what’s needed to create a replacement for Circuit Breaker Dashboard for your Hystrix applications running on PCF.
We’ll need a new visualization tool to replace the dashboard. And to populate that, we’ll need to publish our metrics in a format it understands. We’ll use Spring Boot’s Micrometer integration to publish our metrics, along with a couple of different methods of collection and visualization. Micrometer provides a simple metrics collections facade for the most popular monitoring systems, allowing you to instrument your Spring application code without vendor lock-in. Think SLF4J, but for metrics.
Publishing Metrics Directly
In the simplest configuration, we’ll explore here, we’ll configure Micrometer to ship metrics directly to a metrics registry. In this example, we’ll use Datadog, a popular SaaS offering, as our registry and visualization tool. Micrometer will ship metrics directly to Datadog via its secure API.
By just adding a couple of dependencies to Datadog registry and actuator-autoconfigure, we will have our application autoconfigured to send Hystrix metrics to Datadog.
build.gradle dependencies { compile 'org.springframework.cloud:spring-cloud-starter-netflix-hystrix' compile 'org.springframework.boot:spring-boot-starter-actuator' compile 'io.micrometer:micrometer-registry-datadog' ... } pom.xml <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-netflix-hystrix</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-datadog</artifactId> </dependency> application.yml management: metrics: export: datadog: enabled: true apiKey: {your-datadog-api-key}
Metrics are rate-aggregated and pushed to Datadog on a periodic, configurable interval. Looking at the Datadog metrics explorer we can see how Hystrix metrics were pushed.
Replicating the Circuit Breaker Dashboard
Now that we have our metrics being published to Datadog, let’s look at how to build a replacement circuit breaker dashboard. To do this in Datadog, we can simply build a custom dashboard of charts for our Hystrix metrics.
hystrix.execution` has the metrics for all the Command Execution Event Types.
Each method annotated with `@HystrixCommand` will have its own key, making it very easy to plot, count and alert from.
By default, the Circuit Breaker Dashboard had a histogram for the 90th, 99th, and 99.5th percentiles. In order to enable those metrics to be sent through Micrometer, the following properties have to be added to the client application:
application.yml management: metrics: distribution.percentiles.hystrix: 0.90,0.99,0.995 distribution.percentiles-histogram.hystrix: true
With the percentiles histogram enabled, we will have access to each phi we have set as properties:
Publishing Metrics via the Platform
So far we’ve been publishing Micrometer metrics directly from our example application to a registry—in our case, Datadog—directly via the registry API. While this is simple, it’s hard to scale. Every application must be configured with the registry API credentials, and should you want to switch to another registry in the future, every application will need to be rebuilt with different dependencies and new configuration.
An alternative approach is to publish metrics to the Loggregator system. Loggregator collects all the application logs and metrics it receives from across the platform and makes them available via the Firehose. ‘Nozzles’ can be attached to the Firehose to extract and publish subsets of the Firehose data stream to various downstream systems. In this example, we’ll show how to publish our application metrics to Datadog again, only this time via the Firehose and Datadog nozzle.
To get started, we’ll need a way to publish custom metrics to Loggregator. While application logs and system health metrics are forwarded to Loggregator by default, the same is not true for custom application metrics, such as those emitted by Hystrix. Fortunately, this is easy to achieve, particularly for Spring Boot applications.
The platform component we need to make use of here is Metric Registrar. Metric Registrar is enabled by default in the Pivotal Platformand manages the publication of custom application metrics to Loggregator on a per-application basis. In order for it to work, the app must emit metrics in a format that Metric Registrar understands, and the app must also be registered so that Metric Registrar knows where it should be collecting metrics from.
Metric Registrar understands two methods of application metrics publication. It can either poll a REST endpoint which exposes metrics in Prometheus format, or it can consume structured log entries from your application logs. If the application publishing metrics is a Spring Boot app, then no code changes are required to expose a Prometheus endpoint, so let’s explore this approach. It’s simply a case of swapping out our Datadog registry Micrometer dependency from the previous example for a Prometheus one:
build.gradle: dependencies { compile 'io.micrometer:micrometer-registry-datadog' compile 'io.micrometer:micrometer-registry-prometheus' ... } pom.xml: <dependencies> ... <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-datadog</artifactId> <version>${micrometer.version}</version> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> <version>${micrometer.version}</version> </dependency> ... <dependencies>
As is usual for Spring Boot projects, this dependency is version managed by the Spring Boot dependencies BOM (Maven) or Spring Boot Gradle Plugin. You’ll find more information about these in the Spring Boot documentation.
Spring Boot Actuators will now detect the Micrometer Registry Prometheus dependency, and automatically configure an /actuators/prometheus endpoint in our application. The next step is to register this endpoint with Metric Registrar so that it knows to include the endpoint in its list that should be polled for metrics to publish to the Firehose. For this, we need the Metric Registrar Cloud Foundry CLI plugin:
cf install-plugin -r CF-Community "metric-registrar" cf register-metrics-endpoint <your app name> /actuator/prometheus
That’s everything we need in place in order for our Hystrix metrics to start appearing on the Firehose. You will notice in your application logs that Metric Registrar periodically polls the endpoint–by default every 30 seconds.
Visualization with PCF Metrics
The Pivotal Platform offers a metrics visualization system that integrates right into the platform, in the form of the PCF Metrics tile. This can be downloaded from Pivotal Network and installed via Ops Manager. Once installed, application metrics and logs from the Firehose are immediately visible at https://metrics.<your PCF system domain>:
By default, the PCF Metrics dashboards display key performance indicator application metrics such as request latency, CPU and memory utilization, but as you can see in this example, it’s easy to add charts for custom metrics, such as those from Hystrix, to the dashboards too.
Publishing from the Firehose to Datadog
A great advantage of publishing metrics to the Firehose is that we can forward these anywhere we like, and to multiple locations simultaneously should we so choose. In this example, we’ll attach a Firehose nozzle to forward our metrics to our aggregator of choice.
As before, we’ll use Datadog as an example, so we’ll need a Datadog nozzle implementation. The Datadog Firehose nozzle is part of the Datadog Cluster Monitoring for PCF product, which can be downloaded and installed from Pivotal Network. The documentation describes how to install the product via Ops Manager – in particular, you’ll need to enter your Datadog API key, and also create a UAA client to allow the nozzle to authenticate with the platform. Once this is done, you should see your application metrics appearing in Datadog.
Limitations With Pull-Based Metrics Publication
The Circuit Breaker Dashboard in previous versions of Spring Cloud Services, and our ‘direct publication’ Datadog example in this article, both use a push-based model of metrics collection. In this approach, the instrumented application is responsible for metrics calculations and must ‘push’ all these metrics to the receiver. However, the current trend in the industry is moving toward a pull-based model, as implemented by Metric Registrar. This relieves the client from a significant workload, since instrumenting a service is cheaper on the client-side if the server is the one doing all the heavy lifting for complex calculations.
This changes how the metrics are calculated, however, requiring calculation support on the server-side. Concretely, in terms of Hystrix metrics, the only metric we can’t yet calculate which was present in the original Circuit Breaker Dashboard is the 90th, 99th, and 99.5th percentile information. Neither PCF Metrics nor Datadog have the support for making these calculations built-in yet, but they will be implemented in the future.
Quantiles are expensive to calculate accurately because they need a full set of samples. Histograms make this by sampling the observations in buckets and the quantile can be used to observe the value that ranks at that number among the observations. The instrumented application exposes the calculation of quantiles from the buckets:
# TYPE hystrix_latency_total_seconds histogram hystrix_latency_total_seconds{group="DoSomething",key="greetings",quantile="0.9",} 9.8304E-4 hystrix_latency_total_seconds{group="DoSomething",key="greetings",quantile="0.99",} 0.335511552 hystrix_latency_total_seconds{group="DoSomething",key="greetings",quantile="0.995",} 0.335511552 hystrix_latency_total_seconds_bucket{group="DoSomething",key="greetings",le="0.001",} 727.0 hystrix_latency_total_seconds_bucket{group="DoSomething",key="greetings",le="0.001048576",} 727.0 hystrix_latency_total_seconds_bucket{group="DoSomething",key="greetings",le="0.001398101",} 727.0 hystrix_latency_total_seconds_bucket{group="DoSomething",key="greetings",le="0.001747626",} 727.0 hystrix_latency_total_seconds_bucket{group="DoSomething",key="greetings",le="0.002097151",} 930.0 We can use these buckets to infer the quantiles by running a query using Prometheus query language: histogram_quantile(0.99, sum(rate(hystrix_latency_total_seconds_bucket[5m])) by (le))
Metrics Publication Roundup
We’ve now explored two approaches to publishing custom metrics, such as those from Hystrix, from applications running on PCF. Broadly speaking the two alternatives are ‘push’ based, where all metrics are calculated in the application and shipped directly to a target, and ‘pull’, where metrics are made available from the application and periodically collected by an external publisher.
You might be wondering how to choose between these approaches. Both have advantages and disadvantages.
Push-publishing directly to a metrics registry
- Advantages: Simple to set up–no platform configuration required
- Disadvantages: Registry connection and authentication must be configured per publishing application, metrics can only be received by the configured registry.
Pull-collection by Metric Registrar for publication to the Firehose
- Advantages: Can be implemented in Spring Boot apps through a simple dependency update, metrics are registry agnostic, immediately visible in PCF Metrics within the platform, metrics can be published to multiple registries through multiple nozzles, the Firehose becomes a single source for all platform and application metrics, instrumented applications do not require metrics registry credentials
- Disadvantages: Requires a one-time registration of each app with PCF Metric Registrar, increased IAAS resource utilization required to run PCF Metrics and/or nozzles, visualization of pull-based metrics that require more complex calculations may not yet be supported by the metrics registry you choose to use.
In summary, the simplicity of the direct publication approach makes it an excellent choice for getting started and experimentation. You can quickly test new registries purely through application configuration, without any need to install or manage platform components. As your approach solidifies and you move to scale, migration to the Firehose publication route provides a simple way to unify your metrics strategy across the platform, with preconfigured publication just a single command away for app developers.