Alerting is a necessity in monitoring to maintain production SaaS operations at high quality. With customer experience #1 priority in an “always on” world, DevOps teams need to be able to respond to problems in their environment ASAP. Alerts can automatically let you know when something starts to go wonky, or more importantly, when customers are being critically impacted and immediate action is required.
But operational alerts have the classic Goldilocks dilemma: set too loosely, and you’ll be flooded with false positives; set too strictly, and you’ll be subject to (invisible) false negatives, leading to customers detecting problems before you do. For the negative ramifications of the later, most DevOps teams err on the former, and this often results in alert fatigue taking its toll on your on-call and associated operations teams.
Alert fatigue is definitely worth your investment to solve. In fact, addressing it can lead to incredible cost savings. In a talk at Monitorama PDX 2016 about reducing alert fatigue, Caitie McCaffrey (Observability Team Tech Lead at Twitter) highlights this well. Ignored alerts lead to unreliable systems and then to unhappy customers. At the same time, getting interrupted all the time by non-critical alerts leads to unplanned work, the inability to complete planned work, and then to less time for engineers to focus on the core business. In regards to the psychological impact to on-call and ops personnel, it leads to mental fatigue, fire-fighting, and burnout.
How SaaS Leaders Effectively Tackle Alert Fatigue
Wavefront founders were previously at Twitter and Google. As developers, they used the custom, in-house monitoring platforms that helped to tackle alert fatigue at web scale. In addition, many of Wavefront’s customers are SaaS leaders – Workday, Box, Okta, Lyft, Intuit, Yammer, Groupon, Medallia, to name a few. In our ongoing partnership and discussions with the engineering teams focused on improving production operations, we observe commonality in best practices for how to address the problem.
Many monitoring vendors try to get in front of the alert fatigue ‘parade’ and argue that it’s primarily a technology problem. But like most DevOps challenges, the best approaches require a combined rethinking of people, processes, technology. Here’s what we see as the top 5 things SaaS leaders do to reduce alert fatigue scourge:
1. Audit alerts and remediation procedures (runbooks).
Reviewing your runbook and alert documentation is usually a great place to start. Runbooks are often filled with outdated assumptions, and they likely include alerts that are no longer relevant to how your environment has evolved.
Starting asking, how complete is your alert documentation? Is there an active process to continuously keep it up to date? Take the time to thoroughly document each alert and make the documentation easily assessable (i.e. so someone on-call can easily access it). There is valuable context – tribal knowledge – to capture from developer to paper. Alert descriptions should include: why it’s there, what it means, what causes it, what’s its impact on customers, what are the remediation steps. Such documentation enables on-call personnel to do something about an alert that has woken them up. It’s also beneficial to onboarding new staff.
2. Centralize alert creation and management.
It’s very hard to manage and maintain alerting when its creation is maintained across a whole bunch of disparate monitoring tools. SaaS leaders recognize this and are moving to centralized monitoring platforms that can provide shared visibility and customized alert creation at scale (yet, also with centralized oversight). Such platforms like Wavefront, offered as a hosted service, facilitate this model even further, e.g. Observability Teams can spend more time on facilitating alerting for developers and operations, and less time on maintaining the alert monitoring system itself. “A single source for all metrics” – combined with an advanced set of functionality to create, test, and maintain practically all of your anomaly alerts – enables ongoing alert relevance.
Referring back to Caitie McCaffrey’s Monitorama PDX 2016 talk on Twitter’s alert fatigue remediation, a under emphasized enabler is Twitter’s sophisticated, centralized, “single source for all metrics” monitoring and alerting system – facilitating many of the processes and techniques employed at Twitter to optimize their alerting. Note that not every SaaS company has the luxury of time, experience, and money as Twitter had to build such a monitoring and alerting system themselves. Hence, this is a big driver for why SaaS leaders have turned to Wavefront to do this.
3. Invest more time upfront on alert creation and tuning.
The old adage, “good data in, good data out,” pertains here. The more thought you put into alert quality at creation, the better you will reduce alert fatigue at the other end. Start with a simple but effective categorization for alerts: ‘critical’, ‘warning’, and ‘informational’ – then define the flow for how each is handled accordingly. Critical alerts, i.e. customer impacting alerts, need also to be actionable. Critical alerts based solely on machine specific metrics can be a key source to alert fatigue – make them critical only if the customer is being impacted. Do “warning” or “informational” alerts need to be fired to your on-call staff via PagerDuty after business hours? Note that such alerts can still be quite valuable to help you become more proactive, i.e. you should also alarm on leading indicators – not just problems – but they should be handled differently.
Keeping alerts relevant to your evolving environment is an ongoing commitment, albeit it is time well invested. Tune alert thresholds. Disable or delete inactionable alerts. Make sure you are empowering those on-call to do something about the alerts that get fired to them.
Not to be overlooked is a rigorous way to test new alerts prior to production release, to improve alert quality. Wavefront’s alert creation capabilities happen to provide a unique and popular innovation in this regard. When creating a new alert, users can back-test their alerts on existing data before saving, in order to see when they would have fired in the past.
4. Apply analytics to define more intelligent alerts.
Analytics-driven, intelligent alerting – the application of data science techniques to improve alert sophistication – is revolutionizing proactive anomaly detection in complex SaaS production environments. An analytics query language approach allows the engineer to formulate queries for complex transformations of collected data (e.g. multiple data streams) that can be used for graphing, and then immediately converted into alerts. Previous hidden parameters now become visible, allowing for more fine-tuning and adaptability in alert design than ever before.
Wavefront’s analytics-driven alerting system is based on the industry’s most advanced query engine available. As a result, engineers can craft dynamic alert conditions, which are much more powerful than plain old “threshold-based” alerts. Ultimately, Wavefront users experience fewer false alarms, and are able to better predict how their application and environments will behave going forward.
5. Follow-thru on applying outcomes of incident post-mortems to alert tuning.
This one sounds like an obvious recommendation, but you’d been surprised how many times DevOps teams don’t close the loop on their incident post-mortems and make sure recommended fixes are completed. In a talk at SRECon 2015 about outage trend analysis, Sue Lueder (SRE Program Manager at Google) highlighted not only the importance of digging deep to understand the root cause of an incident (or critical alert) in post-mortems reviews, but then also to make sure the identified fix initiatives are actually completed. Her key recommendations included: “intentionally design, execute, and measure fix initiatives”). This adherence to follow-through led to a significant reduction of outage incidents at Google.
Get into the cadence of weekly on-call retrospectives. At these meetings, ensure there is clear handoff of ongoing issues. Review alerts that fired over the previous week. Ask the questions: Were they actionable? Does the alert’s algorithm for trigger need to change? Should your systems be re-worked to not fire the alert? Then prioritize the fix. Make sure to schedule the work to improve system reliability and on-call support. In the end, it’s about nurturing a culture where critical alerts should never fire twice, a.k.a. no one should ever get woken up for the same thing twice.
The Results
Implement much of this and the results can be impressive quickly. Again at Monitorama PDX 2016, Caitie McCaffrey reported that tackling alert fatigue in a similar way, Twitter was able to achieve a 50% reduction in alerts in one quarter. The additional positive impact included: on-call personnel slept through the night, there was more time to do scheduled work while on-call, and new teammates could be ramped up faster.
Don’t let alert fatigue sap your organization’s energy to deliver great customer experience. To get your mojo back with more intelligent alerting, get started with Wavefront now.