Earlier this week, we announced the general availability of a major upgrade to vFabric Application Performance Manager (APM). This release started one year ago, after we released the first version of the product to market. When we started work on this release, we knew we would need to invest heavily in scalability. APM is designed to help simplify monitoring and management for highly dynamic, large web applications living in the cloud. To succeed, we needed to make sure our product could scale gracefully with our customers. So, we set out with a challenging goal to increase the capacity of APM by a factor of 5.
Transforming a complex product such as APM into a more scalable architecture is not an easy task, let alone doing so in a single release. For this reason we’ve started by modifying the architecture in steps, starting with local improvements inside our virtual appliance, (available in the APM 5.0 release) and moving towards a horizontal scale solution in future releases.
In APM 5.0, we made a significant redesign to the backend and a complete rewrite of the metrics collection and storage module, which used to inhibit APM’s ability to scale. We turned to the market leading products we have in-house in the vFabric product line. We wrote the new module from scratch using various scalable design patterns together with vFabric GemFire in-memory storage, vFabric RabbitMQ messaging, vFabric tc Server for the java runtime container, the vFabric Postgres (vPostgres) database, and some Spring Integration to connect the components together.
As a result, the backend redesign improved the monitoring capacity of APM 5.0 drastically, enabling it to concurrently monitor over 2000 components of various distributed user applications, reaching a whooping throughput of up to 3M metrics per minute, all done on a single VM.
APM UI response-time was also improved dramatically, providing a fast and responsive user experience even for large monitored environments. Several other performance-related glitches were eliminated as well and are now a thing of the past.
Specific Scalable Challenges in the Previous APM Version
When we started our redesign, we faced several specific issues that crippled APM’s performance and ability to scale. This included:
- Slow and inefficient insertion and retrieval of metric data
- Excessive use of the database and high disk I/O
- Tightly coupled components within the server and single-threaded operations
- Synchronous communication between the agents and the server processes
Meeting the challenges in APM 5.0
In the design and development of APM 5.0 we aimed to correct these scalability and performance issues by making the following major changes to the product:
- Replace the metric processing and storage logic
- Apply several scalability design patterns
- Implement various performance optimizations
- Utilize off-the-shelf technologies
- Keep the road open for horizontal-scalability later on
The somewhat simplistic diagram below conveys the architecture in APM 5.0:
The New Metric Store
One of the major limitations of previous APM versions was the inefficient storage and retrieval of the monitored metrics. For this reason we made a complete rewrite of the entire metric insertion and fetching flows.
The metric insertion flow utilizes the classic Worker Pattern, where several workers can grab messages from the incoming metrics queue (i.e. RabbitMQ) and process them inside the Metrics Store in parallel.
The metric insertion operation remains fast and stateless, utilizing GemFire in-memory cache as the first line of storage for the incoming metrics. Initial rollups of the metric data are also performed in-memory at the time of insertion. We’ve found that this initial in-memory processing of the data dramatically reduces the database load and the disk I/O utilization of the VM.
We use vPostgres as a backend database for long term metric data. We designed its schema specifically to handle the massive flow of metrics efficiently, using partitioned sub-tables for fast insertion and fast deletion of the data when it is no longer needed. The representation of the metric data is designed to reduce disk I/O and seek-time by grouping metrics of specific monitored objects together in a single row.
Metrics queries are performed both on GemFire when short term data is needed and on vPostgres when longer term data is required. The two query mechanisms complement each other inside the Metric Store but only a single unifying query interface is exposed externally, making the consumers of the metric data unaware of the internal representation choice.
The fast in-memory metric storage and the parallel execution of the Worker Pattern enables the Metric Store to process over 3M metrics per minute, matching our performance goals for this release.
Applying Scalability Design Patterns
To achieve APM’s current performance at scale we redesigned the APM Server and Metric Store with several commonly used scalability design patterns. Guided by these patterns, we’ve laid out an infrastructure that performs well under the current capacity requirements, but also clears a path towards a future horizontal scale architecture.
Here are a few of the major patterns we’ve introduced in this release:
- Loosely Coupled Components – The components communicate through the integration layer (Spring Integration in our case) and not by accessing dependent components directly. This allows component dependencies to be reorganized as needed, especially for introducing a clustered service setup later on.
- Parallel Execution – We’ve partitioned the functionality of some operations such that separate parts of the operation can be done in parallel, increasing the overall throughput.
- Stateless Services – To allow concurrent calls to the services without the danger of state corruption.
- Asynchronous & Batch Operations – Backend support operations, such as analytic calculations and temporal rollup of metric data, are done asynchronously in batch for higher efficiency.
- Messaging – We’ve employed asynchronous messaging for communication between the various components. This supports the loose coupling pattern, but also enables asynchronous and parallel executions and allows more efficient remote communication between the components.
We’ve decoupled some of our major Spring components by using the Spring Integration abstraction layer, as mentioned above. This change allowed us to easily introduce parallel execution or remote asynchronous communication through a simple configuration change to the Spring Integration, leaving the components agnostic of such a change.
We’ve modified all agents and probes to communicate with the Server and the Metric Store using asynchronous messages. This is a major scalability improvement as the throughput of multiple agents communicating concurrently can be substantial, and when done synchronously, can inhibit the server’s resources considerably. Moreover, most of the AMQP messages are passed “unacknowledged” (i.e. using noAck mode), especially for flows that require high throughput. These specific flows were designed to tolerate occasional loss of messages to keep the functional integrity intact.
Several other performance related improvements were added in this release as well: more efficient KPI and baseline analytic calculations, faster and more robust access to external resources, and a
n more optimized retrieval of the presentation data, to name a few.
An Architecture for the Future
The APM 5.0 backend changes are a major step towards a highly scalable architecture.
The overall redesign also laid the groundwork for future horizontal scalability changes, as the design patterns and technologies introduced in APM 5.0 are crucial to support partitioning of data and partitioning of operations over several distributed processing nodes, which are the cornerstone for horizontal scalability.
If you would like to learn more about Application Performance Manager, you may find the following of interest:
- Making Sense of Spaghetti Transactions
- Spring Insight “split-agent” Architecture
- Understanding the Difference between Spring Insight Developer & Spring Insight Operations
- Benefits, Features, Trial Downloads, and Resources for Application Performance Manager
|About the Author: Ram Janovski is a Staff Engineer at VMware, working on vFabric Application Performance Manager. Ram has over 10 years of experience in software engineering, specializing in software design, scalable architectures, and application performance management. Ram holds a M.Sc. in Computer Science from Ben Gurion University, and previously worked on VMware vCenter Infrastructure Navigator (VIN) and on VMware Application Dependency Manager (ADM) at EMC.|