I recently chatted with Scott Bonebrake, Principal Software Engineer in the Data Engineering and Analytics (DEA) team at Microsoft Yammer. Yammer is a secure enterprise social network internal to an organization, that enables people to connect and engage across their company. We discussed why Yammer chose Wavefront by VMware and how they are using Wavefront to increase their ROI. In this blog, I recap key highlights of Scott’s VMworld 2019 presentation, including advanced but useful topics such as Dashboard/Alert Templating and Dashboard Variables.

Before Wavefront – Capacity, Reliability and Support Issues

Yammer’s engineering organization has always been metrics heavy. Engineers used Ganglia but quickly outgrew its ability to support the volume of metrics and encountered reliability issues. They started to hit capacity problems when their cluster had around 10 VMs, and it cost them nearly a full-time engineer to support Ganglia. Moreover, they couldn’t adequately monitor their system because of metrics tools’ frequent failures.

Wavefront as a Solution

Yammer engineers had to find a reliable solution that scales. And this is where Wavefront was the perfect fit. Here are some of the many advantages of Wavefront:

  • Retention. Yammer’s observability data is retained long-term with high resolution without summarization or aggregation of metrics.
  • Fine-grained visibility. Developers can usually pinpoint the specific date, deployment, and commit, which is great for troubleshooting. Other solutions they tried did not provide this type of visibility.
  • Trends analysis. Yammer’s engineers can track long-term trends and answer questions such as: When did this endpoint start to slow down?

Full-Stack Enterprise Observability with Wavefront

Most of the Yammer service metrics come from their Dropwizard metrics library. Dropwizard was created at Yammer as an open-source framework for REST API development. About 90 of their 100 microservices use Dropwizard. The last 10 are Rails services, including services associated with the original Rails monolith. Those services use StatsD to send metrics to the Wavefront proxy. The DEA team also uses Python scripts to extract metrics from status endpoints for HAProxy, Airflow, cAdvisor, and Fluentd.

Yammer also utilizes the new Wavefront Azure integration to monitor alerts from Azure resources. They have a single dashboard to monitor both their Azure service and any Azure resources, like storage accounts, queues, Event Hubs, or databases, and no longer have to switch between different observability solutions. The Yammer DEA team also has a custom metrics pipeline built using Azure Event Hubs and Storm for preprocessing. This pipeline allows them to convert some legacy metrics formats and to filter out some unimportant metrics.

 

Yammer Engineers Love Wavefront

Reduce Tool Sprawl with Wavefront

Yammer engineers had a wide variety of tools to monitor services, and each team was free to pick which tools they wanted. Their Ruby team had New Relic, their Java team had in-house tools, while the infrastructure teams used tools like Ganglia and Check MK. The result was low visibility of service and infrastructure health between teams. Each team was familiar only with their own tools.

Supporting so many solutions became a problem. Yammer’s solution was to start using Wavefront for all of those tasks. All of their services, VMs, and databases are required to send metrics to Wavefront. Each service must have associated Wavefront alerts to notify their on-call team when failures occur. There should be no issues before rolling anything to production. And each service must have at least one dashboard to display the service health in all regions. Also, anytime a site reliability issue is undetected, they make sure to instrument a new alert so that they can detect it faster in the future. At Yammer, they consider it a failure if their customers have to inform them about the problem. So, they need to detect any error early, and Wavefront alerts help them do it.

Exponential Growth with Wavefront

Over four years, the Microsoft Yammer cloud service had rapid growth. When the DEA team moved their cloud services out of the data center to Azure, their metric numbers rapidly increased. That’s because, in parallel, DEA containerized their microservices, which resulted in thousands of Docker containers as new metrics sources. And as Yammer’s services grew in complexity and metrics volume, so did their alerts and dashboards. Yammer currently serves millions of worldwide customers across hundreds of thousands of networks in all geos. They have over 3,000 alerts and 500 dashboards.

Today, Wavefront is a required part of the Yammer engineering process. Developers are expected to use Wavefront to observe issues in their code. Below are a few use cases that Scott presented in his session at VMworld 2019.

Increase ROI by Automating CI/CD Processes

Initially, the Yammer engineering team’s validation process for code deployments was manual and time-consuming. Their CI/CD process was like this:

  • Open the Wavefront dashboard for the service
  • Kick off the release in AzureDevOps, which is a continuous integration service
  • Watch the Wavefront dashboard for 5 or 10 minutes to verify that everything is deployed correctly to the particular cell (a cell is Yammer’s unit of deployment – a set of containers)
  • Repeat for each cell

With this process, each engineer is engaged in the deployment of her/his pull requests. Engineers perform this process 20-50 times per day across many services. Each pull request can easily take an hour. And that’s an hour of an engineer’s time that they don’t want to waste. Saving 20-50 hours of the engineering team’s time (engineering man-hours per day) is a significant achievement from the ROI perspective.

Now Yammer engineers are using Wavefront to perform automatic validation of a service rollout. In a period of five to ten minutes after service deployment, Yammer automated scripts are checking for alerts tagged with the selected cell and the service name.

Automatic Validation of Service Rollout

These scripts are monitoring in an automated fashion for error code rates, response codes for service, and the error log rates. This automation helps engineers to be completely hands-off. Engineers merely get to click a button on their pull request, and it automatically rolls out through all of the cells. This process saves considerable time for Yammer’s engineers and avoids frustration.

Yammer still allows manual deploys. If they have a substantial pull request or need to validate metrics that are not included in the alerting, they can do a manual service rollout validation. Automated deployments are what they most often prefer. Automated pull requests are usually low risk because they are associated with small code changes.

Use Powerful Dashboard Variables to Get Regional Insights

Another use case for Wavefront at Yammer is related to using dashboard variables to customize a large number of dashboards. When a user modifies a dashboard variable, the related change propagates into all of the queries and into all of the charts using that variable. Developers set up dashboard variables so that users can select the values from a dropdown list. Yammer uses dropdown lists, such as ‘cell’ and ‘region’. That allows users to have insights into each cell and region across a large number of charts. For example, a dashboard can show charts for region A or region B by allowing a user to select the appropriate region from the dropdown list.

Dashboard and Alert Templating for Chart and Alert Generation

Yammer’s engineers developed a Ruby domain-specific language to generate dashboards from the function calls. A single line of code can generate several charts based on standard types of Dropwizard metrics, such as HTTP response rates or error logs per second. That way, engineers can ensure that all of Yammer’s Dropwizard frameworks include the same standard set of metrics in their codebase. It only takes about 10 minutes to spin up a dashboard.

Yammer engineers like templating because it puts dashboard creation and modification under the source control process. Developers can make sure that any changes, or any new dashboards, go through a review process. Thus, Yammer’s engineering can ensure a high quality of all dashboards.

Developers can also create alert templates, and all alerts are peer-reviewed. The alert must link to a runbook and a PagerDuty service. The advantage of this approach is that one YAML file generates four different alerts, one for each region. And that approach ensures consistent alerting across all of Yammer’s regions. Nothing gets left out, without much noise or any possibility of alerts becoming inconsistent across regions. As a result, alert templates ensure easy observability scaling.

Using Wavefront Dashboards for Collaboration

Another interesting Wavefront use case at Yammer is using dashboards for engineering collaboration or as a wiki-style knowledge base. Within a Wavefront dashboard, engineers can add important links to various other places that are used for troubleshooting.

For example, in a microservices world, it is pretty common that one service calls several other services. If you see an increased error rate in one service, it may be helpful to look at downstream services to check if they’re also experiencing an increased error rate. Then, you can use the Wavefront dashboard wiki to learn about downstream services. The problems within downstream services might be causing the incident in the first place. Another way to understand if a downstream microservice is causing the incident is to use the Wavefront Distributed Tracing.

Benefits of Using Wavefront

Wavefront benefits the Yammer engineering team in many ways. Yammer can ship code faster with less hands-on involvement from engineers. Also, observability is mandatory at Yammer, and metrics are an essential part of the engineering process.

Benefits of Wavefront for Yammer Engineering

Furthermore, Wavefront is the key to maintaining Yammer’s SLAs because it is a fast and reliable observability platform. And – Scott’s favorite – instead of maintaining the monitoring solution, developers spend their time developing features.

Summary

Yammer engineering chose Wavefront for its reliability and capability to scale well. Also, Yammer engineers use innovative concepts to get additional value from Wavefront. Among them, automated dashboards and alerts generation are implemented as a part of Yammer’s codebase. By using Wavefront as an integral part of their CI/CD pipeline, Yammer engineering delivers better code, faster. If you want to find out how Wavefront can benefit you, try it for free here.

Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ

 

The post Yammer Increases Code Reliability and Saves Developers’ Time by Using Wavefront appeared first on Wavefront by VMware.