Introduction
In this article, we will follow the fictitious First Bank of Ebberød through their somewhat rocky entry into the world of high-frequency algorithmic trading and how Wavefront can help them.
The bank has suffered from poor profitability and a lack of innovation for a long time. To address both of these issues, the bank leadership decided it would launch an algorithmic trading division. Unfortunately, their budget was limited, so they had to go with the cheapest bid for a consultant to help them and settled for Johnny’s Tires and Quants. After investing more time and money they had anticipated into the project, it was time for the big day of live trading.
And it was… a spectacular failure.
Performance was horrible and the bank was losing money on their algorithmic trading instead of making a profit.
Luckily, their virtualization admin had heard from a new tool from VMware called Wavefront and recommended they try it!
The road back to profitability
A team was assembled to try to find the root cause of the performance problem. Somewhat to their surprise, the math provided by Johnny’s Tires and Quants is sound, but the implementation suffers from serious performance issues. It’s also decided that rewriting the application from scratch is not an option, due to cost, so the team moves on to trying to fix the performance problems in the application.
Unfortunately, traditional profiling tools don’t find the problem. First of all, the problem only seems to manifest itself in the production environment. Second, it seems to appear vary over a long period of time. Finally, it’s suspected that it will be very hard to find the problem without being able to correlate application performance with performance of the platform, as well as business metrics, such as transaction volumes and revenue numbers.
Instrumenting the Application
Structure of the Application
The trading application consists of two main classes: The Trader contains all the logic for making trades, such as selecting the stock, the amount and whether to buy or sell. The MarketLink is responsible for sending orders to the markets. Since Ebberød wants it to make money all the time, it’s driven by an infinite loop that just keeps trading.
If you want to follow along in the code as we instrument the application, you can find it here.
Instrumenting with Dropwizard Metrics
The performance team at Ebberød decide to use Dropwizard Metrics to instrument the application. This is an open source instrumentation framework that can output statistics to a wide range of analysis tools, including Wavefront. You can find Dropwizard Metrics here. Since the bank wanted to use Wavefront to analyze the data, they also downloaded the Wavefront Dropwizard plugin from here.
Let’s get straight to it! The most basic type of metric is a Meter. This is simply an object that registers every time an event occurs. It’s typically used for checking the rate at which a method is called. Metrics are managed by a metric registry. This is essentially a cache of all metric object you create. Let’s have a look at the Trader class. Here we decided to use a Timer instead of a meter. Timers produce a wide range statistics on timing and throughput, such as rates, mean response times, percentiles, etc.
public class MarketLink { private Random r = new Random(); public enum TXType { BUY, SELL }; private final Timer t_trades = MetricHelper.getRegistry().timer("MarketLink.placeOrder.time"); private final Timer t_sendReqeust = MetricHelper.getRegistry().timer("MarketLink.placeOrder.sendRequest"); private final Timer t_waitReply = MetricHelper.getRegistry().timer("MarketLink.placeOrder.waitReply"); /** * Places an order and waits for a response. */ public String placeOrder(String symbol, TXType txType, int amount) throws InterruptedException { Timer.Context ctx = t_trades.time(); try { String token = sendRequest(symbol, txType, amount); return waitReply(token); } finally { ctx.stop(); } }
What’s going on here? First we need to obtain the timers from the registry. In this case we just declare them as fields of the class. Now we can use the timer in our methods to track how long they take to execute. To do that, we first obtain a context. See this as a specific instance of a timer. Obtaining the context also starts the timer. Them we do our processing and finally (no pun intended) we stop it in a “finally” clause. The reason for the “finally” is to make sure we stop the timer even if an exception is triggered by our code.
Before we can do any metric reporting, we need to associate a Reporter with our registry. In our case, this is done in the class MetricHelper and we’re hooking up a WavefrontReporter to send data to Wavefront through a proxy.
public class MetricHelper { private static final MetricRegistry registry; static { registry = new MetricRegistry(); WavefrontReporter reporter = WavefrontReporter.forRegistry(registry) .withSource("ebberod bank") .withPointTag("dc", "ebberod") .withPointTag("service", "trading") .withJvmMetrics() .build("localhost", 2878); reporter.start(20, TimeUnit.SECONDS); } public static final MetricRegistry getRegistry() { return registry; } }
Along with the address and port of the Wavefront Proxy, we can also specify a source name and point tags to further identify the data points.
Let’s see what happens when we run our code with instrumentation turned on! Here’s what the graph looks like in Wavefront! This graph shows the delay imposed by the placeOrder function. As you can see, we’re experiencing a >50ms delay that keeps growing. That’s certainly not good enough for low-latency trading!
Let’s see how many trades per second we can execute. When you’re declaring a timer, it gives us a lot of metrics other than the timing itself. For example, we get all the percentiles for the execution time, as well as the 1, 5 and 15 minute throughput averages. Let’s have a look at the 5 minute throughput averages.
This is already starting to look interesting! Something somewhere in the pipeline is making our throughput drop significantly once every 15 minutes. But let’s not get ahead of ourselves! We’re going to do some real troubleshooting in a little while.
Gauges: Publishing your Custom Metrics
Now that we have all the method timing, we may want to add some business metrics. One thing that comes to mind is is that it would be great if we could somehow keep track of the amount of money flowing through the system. We can do that using a gauge, which is just a way of telling Dropwizard Metrics how to get a certain metric and associating it with a metric name. In our case, we’re summing up the total revenue flowing through the system. Let’s have a look at how the guys at Ebberød implemented it:
public class Trader { private long totalRevenue; public Trader(MarketLink market) { this.market = market; MetricHelper.getRegistry().register("trader.totalRevenue", new Gauge<Long>() { public Long getValue() { return totalRevenue; }
Let’s examine this! The field totalRevenue is update by the application whenever a new trade comes through. We’re adding and registering a Gauge in the constructor. A gauge is just a way of telling Dropwizard Metrics how to access the value. The getValue method is the key here. It just reads the current value of totalRevenue. This way, Dropwizard can get a spot value whenever it wants without us having to worry about it. So what will it look like in Wavefront? Something like this:
ts("trader.totalRevenue")
Not surprisingly, it’s a steadily increasing graph. How can we make it a bit more interesting? Here’s the section you’ve been waiting for!
Troubleshooting the Trading Application
Now that we’ve got everything set up, it’s time to try to troubleshoot the application. To begin with, let’s see if we can get some information out of that revenue graph. At a first glance, it looks like it’s just growing linearly, but if we look very closely, there seems to be some bumps. Let’s see if we can find any pattern in the rate of change. We can use the “rate” function for that. Here’s the query:
rate(ts("trader.totalRevenue"))
We’re starting to see two interesting patterns already: The revenue seems to be dropping off every 15 minutes. Also, there seems to be a longer term trend. Let’s see if we can correlate it with something we instrumented from our application. Let’s start with the obvious one: Trading throughput. If they are correlated, it means that the drop in revenue is because of a problem with application performance.
Let’s see if they’re correlated by using the moving correlation function. To smooth out any noise, we’re correlating over a fairly long time range (2 hours). A result of 1 represents perfect correlation and 0 means no correlation at all. Anything above 0.7 usually means that there’s a clear correlation. Let’s have a look! Notice how we refer to a Java method in the query. This happens to be the main entry point for the trading algorithm. The “m1_rate” qualifier means that we want the rate of executions per minute.
mcorr(2h, rate(ts("trader.totalRevenue")), ts("com.ebberod.trader.Trader.makeTrade.m1_rate"))
It looks like revenue and throughput are pretty well correlated, so it’s definitely an application performance issue. What can we learn from looking at the timing of the “makeTrade” method?
ts("com.ebberod.trader.Trader.makeTrade.mean")
It looks like we have two issues here. First of all, we seem to have a spike in the time it takes to run the algorithm about once every 15 minutes. There also seems to be a clear longer term trend. Let’s address the 15 minute spikes first. We could go ahead and run a correlation across all method data and pinpoint it that way, but in this case, the number of methods is probably limited enough to do a visual assessment. Since Wavefront supports any combination of wildcards, we can get the mean execution time for all methods on all classes in the com.ebberod.trader package.
ts("com.ebberod.trader.*.mean")
This looks interesting! You’re actually looking at a live chart from Wavefront, so feel free to hover over the graphs to see which methods contribute the to the delay.
If you hover over the bottom spiky graph, you’ll see that it’s the “pickStock” method. This is very likely the one contributing to the spikes in overall delay. Also, at the very top, you have a method called “recalculateDecontribulators” that takes 500ms every time its executed. Could these be related?
Time to look at the code! This snippet has the instrumentation code removed for clarity.
private String pickStock() throws InterruptedException { // All great thinkers need to take their time, so we wait a bit before we make a decision // Thread.sleep((long) Math.abs(r.nextGaussian()) * 10); // Recalculate decontribulators if needed // if((System.currentTimeMillis() / 60000) % 15 == 0) { recalculateDecontribulators(); } // Now it's time to make our important decision using the finest math! // return stockSymbols[r.nextInt(stockSymbols.length)]; } private void recalculateDecontribulators() throws InterruptedException { // Very complicated calculation going on here!! // long end = System.currentTimeMillis() + 500 + (long) Math.abs(r.nextGaussian()) * 10; while(System.currentTimeMillis() < end) { long x = 0; for(int i = 0; i < 10000000; ++i) { x += x % 8748574; } }
It looks like Johnny’s Tires and Quants made a pretty big boo-boo here! Every 15 minutes the decontribulators (whatever that is) are recalculated by the same thread that handles the trading. To make matters worse, it doesn’t even check if it’s already done it, so EVERY trade within that minute will get caught on recalculateDecontribulators. The folks at Ebberød decide to move the recalculation of the decontribulators to a separate thread instead.
But before they do that, they wanted to make sure that would really solve the problem. In other words, if they got rid of that periodic recalculation, would the spikes disappear? There’s a very easy way we can simulate that! Just subtract the “pickStock” delays from the delays of the top-level “makeTrade” function!
ts("com.ebberod.trader.Trader.makeTrade.mean") - ts("com.ebberod.trader.Trader.pickStock.mean")
Looks like that did the trick! What’s left is just that long term trend and some fairly insignificant noise.
Finding the Reason for the Long Term Trend
Let’s try to find the reason the chart above seems to vary periodically. If you recall, we’re looking at total delay minus the delay caused by those 15 minute spikes. Let’s take that graph and correlate it with the delays of all methods and show only those with a good correlation.
We get two results: MarketLink.placeOrder and MarketLink.waitReply. So it appears the problem happens in the code that communicates with the market and specifically in waitReply, which is awaiting the confirmation of an order.
It turns out that the bank, in their desperate attempt to save money, went for the Super Saver MarketLink Plan, which suffers from serious delays, especially during the times of day with heavy traffic. It appears that Ebberød will have to bite the bullet and upgrade to a better MarketLink plan. But first, let’s simulate what it would look like. We’re subtracting the periodic delay imposed by waitReply and adding a 5ms delay instead, which happens to be the SLA for the more expensive plan.
ts("com.ebberod.trader.Trader.makeTrade.mean") - ts("com.ebberod.trader.Trader.pickStock.mean")- ts("com.ebberod.trader.MarketLink.waitReply.mean") + 5
Looks much better! We went from a worst case of 200ms to a worst case <50ms, which is within the limits of what the bank needed. Armed with this information, the project staff can start working on a get-well-plan to get the performance. Within reasonable limits.
Tying it Back to Operations
Now that the devops team has gotten some breathing room, they have time to look at some other, less catastrophic issues. One of them is some strange CPU spikes they’ve seen since they deployed the algorithmic trading project. Although the team already had their suspicions, they decided to use the correlation functionality of Wavefront to try to find the root cause. The hypothesis they want to test is that there’s some method that’s called more frequently when the CPU spikes occur. So they correlate the CPU load against the call rate of all methods in the system. It turns out all the CPU spikes can be attributed to, you guessed it, recalculateDecontribulators that we found to be a performance hog earlier in this exercise.
First Bank of Ebberød – A New Rising Star in Banking!
Armed with the insight provided by Wavefront, the First Bank of Ebberød is able to fix its algorithmic trading application at a lower cost than expected and quickly turned an embarrassing failure into a profit machine. Their CIO is now on the front page of every industry publication as the poster child of a miraculous turnaround. Johnny’s Tires and Quants was propelled from obscurity to a prominent position in the algorithmic trading world. They’re even thinking of scaling back on the tire business to focus on the trading side of the house.
In order to ensure a continued smooth ride, the bank implemented a series of dashboards and alerts to help them act proactively. For example, when dropping a new release, they will get instant alerts if for example transaction volumes drop below the normal range or if delays start creeping up. The CIO even ordered the team to put a monitor in his office showing him a high-level dashboard with some key metrics.
Instrument More – Write Less
One of the major lessons learned from the algorithmic trading project is the importance of measuring everything! This is why the VP of R&D mandated that all code written by the bank has to be instrumented at a very granular level. At first the developers we’re a bit grumpy about this, but they quickly realized that there were tools out there to help them.
One is the annotation feature of Dropwizard Metrics. If a programmer uses the @Timed annotation on a method, it automatically becomes instrumented. It looks like this:
@Timed(absolute = true) private MarketLink.TXType pickTxType() { // Buy or sell? Only the random number generator knows! // return r.nextInt(10) > 5 ? MarketLink.TXType.BUY: MarketLink.TXType.SELL; }
This requires you to run your code in an application server that supports AOP (Aspect Oriented Programming). Unfortunately, the algorithmic trading platform is a POJO (Plain Old Java Objects), so they needed something that could automatically instrument code that’s not running in an application server. They found MetJo, which allows automatic instrumentation of POJOs, as well as instrumentation of code without even having to change a line of code (or even have access to the source code, for that matter). For more information, please go here.
Conclusion
The fictitious example of the First Bank of Ebberød turns out to be pretty representative of what performance troubleshooting looks like in the real world. The key point to remember is that code instrumentation generates massive amount of data that needs to be analyzed in an instant. This is exactly where Wavefront excels, along with the rich integrations into data collection frameworks such as Dropwizard metrics and Telegraf.
If you found this article interesting, why not go try Wavefront for yourself! You can sign up for a free trial account here!
Useful links:
Wavefront website – https://www.wavefront.com/
Dropwizard Metrics – http://metrics.dropwizard.io/3.2.3/
Dropwizard plugin for Wavefront – https://github.com/wavefrontHQ/dropwizard-metrics
Demo application – https://github.com/prydin/wavefront-ebberod
MetJo instrumentation engine – https://github.com/prydin/metjo