Features

The Challenge of Measuring Progress in Open Source Projects – Part One

By Alexandre Courouble and John Hawley

How do you know if your open source project is succeeding? Isn’t it just a question of tallying contributors, commits or stars on GitHub?

It turns out that this is a surprisingly soft question with no real hard answers to grab onto, at least at present. In part one of a two-part series, we’ll look at the existing challenges facing the creation of solid progress metrics for open source development. Then, in part two, we’ll explore how we might start to make things better.

What numbers should you pick?

The first thing to note is that the open source community doesn’t – and perhaps can’t – agree on what metrics really count.

At the most basic level, we can argue over whether total commits and number of contributors or stars are good ways to gauge impact.

Each of these does tell you something about the scale of, and interest in, a project. But stars or total commits, which are the easiest to tally, are so crude that they’re essentially useless for drawing all but the most basic of conclusions. An important new project might not have many stars, contributors or commits. Another might grow in value to the community in proportion to the number of times it is changed or the number of people that create it. Bottom line: we need more granular data.

It’s also worth noting that different groups will clearly prioritize different metrics. It should come as no surprise that VMware managers, for example, might care about the degree to which individual employee contributors are having an impact. The leaders of a project like Kubernetes, meanwhile, might want to track their diversity of contributors to ensure the project isn’t reliant on too few contributors from too small a set of institutions.

Then there’s the challenge of definitions. Many projects instinctively reach for the commit as a baseline metric, but each uses git workflows in its own way. It’s not unreasonable that different projects have different rules for how they want to write a commit message, and that makes the commit an especially hard metric to nail down.

Unreliable numbers for open source projects

It turns out there’s a much more fundamental problem with open source metrics, especially for managers looking for a solid understanding of how their open source teams are doing: the hard numbers available to us at present simply don’t stand up to scrutiny. This is an issue with git commits in particular, and that’s a problem because by understanding patterns of commits you also understand what’s actually going on in your project’s code.

You might think you can track commits with relative ease and thus see how things are going. After all, we have large repositories for source code data like GitHub and GitLab that we can interrogate. But when you try to query these data sets, you quickly run into trouble.

GitHub, for example, places a ton of limitations on the availability of its data. These are sometimes imposed for good, technical reasons. Sometimes, though, they simply reflect the reality of Git’s software structure that is optimized for Linus Torvalds’ laptop rather than a large dispersed server and distribution infrastructure. The upshot is that GitHub throttles the API for querying its data. That means you can’t rely on a GitHub API query to provide a comprehensive record of how any of the code it hosts is developed.

In response, users have imported data into large data sets like GH Archive or Google’s BigQuery. These are convenient to query (although you have to pay to use BigQuery in any kind of volume), but at the same time, they do things that seriously compromise their value as comprehensive sources for performance metrics.

GH Archive and Google strip identifying information, such as a user’s email address and their GitHub ID, from the commit. A team manager won’t be able to track team commits or code contributions unless they already know exactly which commits each team member has made. They do, generally, record the domain name of the user’s email address (e.g. @vmware.com), but many open source contributors make their contributions under, and are known by, their personal email address. That means counting up all the @vmware.com contributions in a given period would result in a highly inaccurate measure of VMware-related contributions as a whole.

GH Archive, meanwhile, pulls activity information in from GitHub directly. That’s great, but evidence suggests that GitHub’s backend processes don’t reliably deliver all the data. It’s not just git commits, either. Changes to wiki pages, bug tracking information and other activities just seem to go unrecorded at times. Those of us who have looked for information that we know should be there have, on multiple occasions, not been able to find it without going back and poking around in GitHub to bring it into view, defeating the purpose of querying the archive.

At this point, we also have no idea how much data is missing, so we can’t even approximate totals from partial results. For what it’s worth, you can track commits in these data sets at the repository level, but even then, both GH Archive and Google’s BigQuery also limit what they copy.

Moving toward a solution

For now, anyone looking to measure the success, health or value of a project is in a bind. We have to ask convoluted questions that take time and effort to formulate and then dig deep for the answer—and the answers we get only hint at the answers to the questions we really want to ask. The first step in any recovery is to acknowledge that you have a problem. We’ve come that far, at least. Next time, we’ll look at how we can start to make things better.

Stay tuned to the Open Source Blog for Part 2 of this series around open source metrics and be sure to follow us on Twitter (@vmwopensource) as well.