Build Next Gen Apps VMware Tanzu Observability

Collectd vs. Telegraf: Comparing Metric Collection Agents

This blog post was written by Evan Pease on behalf of Wavefront.

This article will offer a comparison of the 2 most popular collector agents from our experience out in the wild world of metrics, Collectd and Telegraf.

This comparison is not meant to be a sweeping deep dive into which one is better or worse. Both are great. While Wavefront will always be a collector agnostic platform that works with both, we will be prioritizing our efforts around integrations on one of them.

The following will explain which one we are choosing to build on and why.

Using Collectd or Telegraf with Wavefront
At a high-level, Collectd and Telegraf aim to do the exact same thing – collect metrics from your systems then output them to some backend storage. In this case the backend is Wavefront.

Both Collectd and Telegraf have built-in OpenTSDB output plugins. Since Wavefront accepts OpenTSDB formatted data, both can quickly be configured to send metrics to Wavefront with minimal effort. We also have a Collectd plugin just for Wavefront and we’ll have one for Telegraf soon as well.

Both Collectd and Telegraf are open source projects. Both have a very small server footprint. Both emit very similar metrics out of the box (albeit with some key differences noted later).

Both have an extensive list of plugins. While there are many other similarities I could list, the rest of the article will focus on three specific areas that we felt were important: Community Support, Tagging, and Plugin Architecture.

Community Support

Collectd

The fact that Collectd has been around since 2005 is a testament to the community and the original design by Florian Forster. It is a lightweight daemon written in C that’s small enough to run on embedded devices. Its modular design has made it possible for developers to create over 100 plugins for Collectd spanning many different applications and use cases.

2005 was a long time ago for an open source project. But browsing the activity on github shows that it is still active. While there’s no official corporate sponsor that I’m aware of, there’s many unofficial sponsors, including some monitoring vendors whom actively contribute plugins for Collectd.

There’s no doubt that many large enterprises use Collectd to monitor production systems which makes it a safe choice (although you’ll want to vet your plugins; more on this later).

Telegraf

A year ago, it wouldn’t have made much sense to compare Telegraf and Collectd. However, over the last year, Telegraf’s popularity has exploded. The number of plugins has grown from 18 to 80 since last August.

Telegraf is officially supported by InfluxData, but the project also has impressive community support. This is evidenced by the number of plugins that have been contributed by the community. Some of our customers have even written their own plugins for Telegraf.

Telegraf is written in Go which is popular choice these days. It also has a small enough footprint to run on embedded devices. In our testing, Telegraf used quite a bit less memory than Collectd with the default configuration.

Several people I’ve talked to about Telegraf assumed it was for InfluxDB only – but that’s far from the case. There’s currently 18 different output plugins for various time series databases, message queues, cloud monitoring systems, and more. The Telegraf project started in 2015 which means Collectd had a 10 year head start.

Conclusion

In our opinion, both projects pass the test when it comes to community support. Collectd has a long history and an established install base while Telegraf has impressive momentum that seems to be growing rapidly.

One question we asked ourselves is if Collectd is starting to show its age. In order to answer this, we have to take a look at what has changed since 2005 in the metrics and monitoring space.

Has the fundamental data model for metrics changed? Are there options that fit better with how we see the future? The next section will go into more detail on this subject.

Tag Support
Modern time series databases support tags. Tags allow users to add 1 to n additional dimensions to their metrics in the form of key value pairs.

This is extremely useful at query time for filtering and grouping aggregations. Modern microservices and containerized systems have many dimensions beyond the metric name and hostname. Older systems like Graphite force users to fit any additional dimensions they care about into the metric name.

And unless every metric has the same number of dimensions, this can make for a very confusing metric namespace.

Collectd

Adding global tags to Collectd metrics is made easy via the write_tsdb plugin. Global tags are applied to every metric that Collectd gathers as they’re flushed to your storage backend.

You can simply add the “HostTags” option to the your write_tsdb section of your collectd config. The example below would append 2 tags to every metric being emitted from Collectd.

HostTags “status=production deviceclass=www”

This is great when we want tags applied to every single metric, but can we add tags to individual metrics? The answer is sort of. But not without hacks. I’ll explain why this is a problem.

There are many common use cases where Collectd plugins are being used to poll things like SNMP devices, databases, and Docker containers just to name a few.

Each of these systems can and often does have metadata that would be useful to have as tags. For example, if you’re monitoring a containerized application, it would be useful to know what service each container is running, which image, and which version while looking for trends in your metrics.

Your containers may change frequently throughout the day. This is information that is best gathered at collection time. This is a use case for tags.

While Collectd does make it easy to add global tags to every metric, it does not provide a straightforward way to add tags to individual metrics.

Even if you set out to write your own plugin, the Collectd API does not have a concept of tags as of this writing. It looks like the Collectd community is working towards a solution – but it still looks like a bandaid attempting to work around the fact that Collectd’s core has no concept of tags.

Telegraf

Telegraf was designed with tags in mind from day one. This is obvious as soon as you start using it. Let’s look a couple of examples using the disk_used metric:

disk_used fstype=aufs path=/
disk_used fstype=vboxsf path=/opt/dev
disk_used fstype=vboxsf path=/opt/go

The disk_used metric has 2 tags out of the box, fstype(filesystem type) and path (volume root path).These tags become queryable dimensions in Wavefront. Here is what they look like rendered as a table in Wavefront for two hosts emitting Telegraf metrics into Wavefront.

This makes it very easy to ask a question like:

What’s the total amount of disk being used by my hosts grouped by volume (the path tag)?

sum(ts(“disk_used”), path)


This is a very clean and simple query to write!

On the other hand, let’s look at how Collectd represents the exact same metrics for the same volumes:

df.root.df_complex.used
df.opt-dev.df_complex.used
df.opt-go.df_complex.used

Notice how the volume path is embedded in the metric name and slashes were replaced with dashes (df.&ltpath&gt.df_complex.used). This is not very intuitive to a user who doesn’t know what to look for. These metrics are also missing any information about the file system type, which Telegraf provides via the fstype tag.

Now if we wanted to ask the same question as we did with Telegraf’s disk_used metric above, how would we do this with Collectd’s disk metrics?

What’s the total amount of disk being used by my hosts grouped by volume (the path tag)?

sum(ts(df.*.df_complex.used), metrics)

This query is not hard to write either. But again, it is just less intuitive than it is with Telegraf. The “metrics” keyword will group any queries by the metric name when passed to an aggregate function in Wavefront.

We could also use Wavefront’s taggify function to create a “virtual” path tag but that requires a more complex query. In any event, if we wanted to do any sort of query on the file system type, we’d be out of luck with Collectd.

Conclusion

Disk metrics is only one simplistic example. There’s many more of them. This may not seem like a big issue, but in practice, e.g. when you’re monitoring complex microservices with many moving parts, having tags can come in very handy. In the end, tags can help you achieve insights faster and make your metrics more useable.

We do believe this is one area where Collectd is showing its age. But being actively community supported, there’s always possibility for Collectd to correct it.

Plugin Architecture

Collectd

Collectd has a modular design that supports plugins in a couple of ways.

Many Collectd plugins are available as packages on Linux repositories. This is a nice convenience feature. When you install one it typically downloads the appropriate C module to the appropriate directory along with any dependency packages.

For python plugins, you just need to enable the Python plugin in Collectd then configure it to use the directory where your python files are located.

This is a very open architecture in that you can add plugins without the need to recompile Collectd itself. There is a tradeoff however. Because it is so open and there are so many plugins, trying to get all of the plugins and dependencies you need installed and configured properly can be an exercise in herding cats. And not all plugins are created equally. I’ll explain.

First, you need to choose the right plugin for your app. This is not always trivial. For example, recently I counted three different ZooKeeper plugins and the official docs are empty as of this writing.

So you’re on your own to find the best one. Not every plugin is guaranteed to have been vetted by the community especially when it comes to dependencies. One of our customers recently installed a Collectd plugin that he claims also installed Hungarian fonts as a dependency. Needless to say his security team had questions.

Second, even for the plugins available as Linux packages: they don’t always work as expected. For example, the Java plugin (collectd-java) installs without error but won’t work on some Linux distributions without manually creating a symlink or moving around files.

See 1, 2, 3. In fairness it seems to be a Java issue more than a Collectd one, but it is still an annoyance especially the first time you run into it. Our customers have reported it multiple times. I don’t mean to cherry pick this one example. Because the architecture is so open and plugins can come from anywhere, there’s potential for problems depending on the quality and up-to-dateness of the plugin you’re trying to use.

Telegraf

Telegraf’s plugin architecture is another difference from Collectd. Telegraf has a modular design for plugins as well. But it’s different in that all of its plugins exist in the same repository and compile into the same binary.

This greatly simplifies the installation process because you only need the Telegraf binary and its config file. You don’t have to worry about installing any additional packages or dependencies when you want to use plugins (exception: Java plugins expect that Jolokia is installed).

This also means that choosing the right plugin is much easier because you’ll only have one choice. The tradeoff of course is that it’s not as open as Collectd in the sense that you cannot just drop in newly developed plugins without recompiling.

Writing plugins for Telegraf is straightforward if you have any familiarity with Go. There’s interface structs and plenty of examples in the repository that you can base your own plugin on should you need to write one.

Compiling and packaging Telegraf is also fast and easy thanks to the scripts the community has provided in the github repo.

Conclusion

There are good arguments to be made for both approaches to plugins. Collectd’s openness comes with occasional pain. Telegraf’s single-binary approach offers uniformity and a consistent user experience at the expense of not being able to “drop in” new plugins without recompiling.

Writing plugins for both is very easy from my experience but I’ve only worked with custom Python plugins in Collectd’s case. I have yet to try to write a C plugin. Which brings me to a final point on this subject.

I can’t help but think that Telegraf’s growth may be somewhat correlated to the growth in popularity of the Go language in general. I do not have empirical evidence to support this statement but it does seem to be much easier to find developers today willing to learn/write Go versus C.

Final Thoughts
As I stated at the beginning, both Collectd and Telegraf are awesome. You really can’t go wrong with either.

When recently discussing our future agent strategy at Wavefront, a colleague asked me a great question that put it into perspective: if you were building a system today, which one would you choose and why? Wavefront will always work with both. But in the end, we will focus more of our efforts on one.

Going forward, you will see us publishing more integrations using Telegraf plugins to our community as well as related blogs on uses cases and implementation tips. In short, we like Telegraf’s built-in tag support and uniform plugin architecture. We look forward to the opportunity to contribute to the project and ensure that our users have the plugins they need.

Summary Comparison

Collectd Telegraf
Website https://collectd.org/ https://github.com/influxdata/telegraf
Language C Go
Date Created 2005 2015
How it works with Wavefront write_tsdb plugin or Wavefront plugin OpenTSDB output plugin or Wavefront output plugin (coming soon)
Plugin Architecture Modular design supporting C or Python plugins Modular design. Plugins are written in Go and included in the Telegraf binary
Tag Support Partial Full
Number of Plugins 100+ 80 and growing fast