Build Next Gen Apps VMware Tanzu Observability

Detecting Service Issues from Twitter with Wavefront

This blog post was written by Evan Pease on behalf of Wavefront.

It was the day before Christmas eve, 2014. Many gamers around the world were unwrapping or downloading Christmas presents in the form of new Playstation games. For online services like Playstation Network and Xbox Live, the days surrounding Christmas are without a doubt the busiest and most important of the year.

Unfortunately for Playstation users that week, satisfaction of their collective online gaming itch would be postponed. Playstation Network was the victim of a crippling, several-day long DDoS attack from a hacker group calling themselves the Lizard Squad. The outage rendered many popular games completely unplayable.

Like many people, I took to Twitter when I heard about it. Twitter can be useful in this regard, because it provides real-time anecdotes from users about their experiences with a particular service long before news feeds or even server status pages do in some cases. Many companies now even handle customer support directly through Twitter. This makes sense, because when users of a service have a problem, they often take to social media to complain about it first.

Why a Twitter Example?
Twitter can be noisy and ­­­­difficult to understand sometimes. But Twitter’s APIs provide a means to look for signals in the noise. Many companies invest heavily in analyzing social media, and often view insights collected from social sites as leading indicators into potential service issues, the performance of marketing campaigns, overall sentiment of their brands, trading signals, and more.

Typically, Twitter demos are reserved for text analytics platforms like ElasticSearch, Solr or Splunk. However, there are many useful metrics that can be collected from a Twitter stream that don’t require the storing of text. Moreover, there’s tremendous value in having real-time business metrics together with application and machine metrics within a single pane of glass. More on this later.

Another benefit of this for many of our customers is that they don’t need to store logs (or tweets in this case) for as long as they would have in the past. This greatly reduces the costs of tools such as Splunk, or ElasticSearch, which are heavy on storage and memory compared to Wavefront.

Instrumenting the App
We created a simple Twitter streaming application that captures 3 metrics in real-time. For a given search term (In this case “@AskPlaystation”, Playstation Network’s support handle):

– Count of tweets with positive sentiment.
– Count of tweets with negative sentiment.
– Total count of tweets (including neutral sentiment).

These metrics are sent as incremental gauges to a StatsD server connected to Wavefront.

In Wavefront, we subtract the rate of negative tweets from the rate of positive tweets to create a “sentiment” metric. Using this methodology, 0 would be neutral. Anything above 0 is positive, below is negative.

The first chart below is the sentiment metric for @AskPlaystation:

– The blue line is the Neutral line (0).
– The orange line is the daily moving average of our sentiment metric (slow).
– The grey line is the hourly moving average of our sentiment metric (fast).

The second chart in the examples below is the mention rate for @AskPlaystation. This simply shows us the rate at which users are tweeting at @AskPlaystation.

Observing Normal
The first thing we want to do is observe our charts during “normal” state. That is when Playstation Network’s services seem to be performing normally. The time period shown below covers roughly 5 days. In addition to the charts below, I loosely (and very unscientifically) observed @AskPlaystation’s twitter activity during some of the spikes/dips to look for service issues.

@AskPlaystation Sentiment during normal state.

@AskPlaystation Mention Rate during normal state.

We can see some patterns during the 5 days of “normal”. Sentiment stays in a tight range between -0.3 and 0.55. The mention rate generally is in a range between 0.01 and 0.05. You can clearly see that the mention rate trends up each day, until around 12 PM PT, then spikes again in the evening (after work hours), before descending back down.

Observing a Minor Service Issue
Now that we have a very basic idea of what normal looks like, we can start to look for anomalies, and see if they correlate with activity on Twitter. Around 4:30 PM PT on January 26th sentiment hit a new low for the amount of data we’d collected so far.

The 1 hour moving average dips as low as -0.385. This is 28% lower than the low observed during “normal”. Sentiment spends most of the next 3 and a half hours in negative territory.
When this new low occurred, I watched tweets flowing into @AskPlaystation on Twitter. It appeared as though the Playstation Store (one of Playstation Network’s services) was having a problem processing transactions for some users. This is a very unscientific conclusion, but whatever the problem was, Sony seemed to resolve it quickly. Also, the Playstation Store being down would only affect a small number of users relative to the total number of users playing online games.

Even with a minor service issue which was resolved relatively quickly, it appears we were able to observe a correlation to users complaining about Playstation store not working.

Observing Major Service Issues
On February 1st, Playstation Network experienced a major, world-wide outage.

This was by far the largest service issue we’d seen since collecting data, and it is clearly reflected in the charts.

@AskPlaystation mention rate during a major outage.

The mention rate reached as high as 1.472 which is almost 19 standard deviations from the norm. Impressive!
On February 13th, the day I originally posted this blog, Playstation Network had another outage, with users reporting seeing a “Playstation Network is down for maintenance” error when trying to login.

Sentiment reached as low as -2. Nowhere near as large of a drop as the world-wide outage on Feb 1st, but still several times larger than the minor service issues we’ve been able to observe. The charts during the outage from February 13th looked very similar to another outage we saw on February 3rd around 12:45 PM PT, with users reporting a similar issue on Twitter.

Comparing Outages
Suppose we wanted to compare the impact of the two outages on sentiment and mention rates? Wavefront gives us the tools to easily accomplish this.

Comparing the impact on sentiment during the outages on 2/3 and 2/13.

Comparing the impact on mention rate during the outages on 2/3 and 2/13.

This shows us very clearly that the February 13th outage had a larger impact on both sentiment and mention rate than the February 3rd outage.

Over the course of this experiment, we were able to observe numerous anomalies in sentiment and mention rates that correlated directly with service issues on Playstation Network. We do not have access to Sony’s machine, application, or business metrics. If we did, there would be all kinds of interesting and valuable ways we could use the analytical power of Wavefront to address service problems. During a DDoS attack, such as the one that occurred over Christmas-2014, we’d be able to clearly identify the affected machines and apps in real-time.

Be sure to check out our demos on Alerting and Anomaly Detection to see how!