Open source is a growing force in today’s world. The world of open source opens up new possibilities for all types of people to contribute to a project. Barriers to contribution and participation are typically low—meaning anyone from anywhere can join a project and contribute. Contributors hail from different countries, time zones and companies; English might not be their first language, and the educational background, experience, availability and motivations for their participation in a community differ. Despite this diversity, a healthy, inclusive community strives to achieve a common goal through various communication channels such as documentation, discussion forums, meetups, blogs—those are the most visible evidence of communication style and tone. But much of the work of an open source project happens in issue and code commit messages and code reviews. And those communications, while not as readily apparent, often form the basis for a project’s reputation as an inclusive, diverse and welcoming community. To understand the true nature of a project’s health, it’s important to ask the question: Are PR reviews constructive and inclusive? Do they foster a positive environment?
Communication in Open Source
To set some context, a pull request (also known as merge request) is a method of submitting code and/or documentation contributions to open source projects. When a pull request is opened, community members review the changes and eventually approve or reject the request.
During this process, the pull request author and community reviewers communicate through comments or reviews. The pull request author starts the conversation with a pull request description explaining the changes made. Then, reviewers look through the changes and provide feedback. Using this communication method, thousands of open source contributors have collaborated to build and maintain well-known open source projects such as Linux, Kubernetes and Tensorflow as well as thousands of lesser known projects. It’s in this critical back and forth exchange between reviewer(s) and author where communication can take a hostile or inclusive tone. Feedback may demonstrate and encourage the design and delivery of better solutions or demotivate and alienate a contributor.
In fact, a survey conducted by GitHub and collaborators from academia, industry and the broader open source community identified unresponsiveness, unexplained rejection and unwelcoming language as problems prevalent in open source projects.
To address open source project health, the Community Health Analytics Open Source Software (CHAOSS) Project, sponsored by the Linux Foundation, provides metrics that define community health through Augur. Augur collects communication data in open source, and analyzes it to determine metrics like sentiment of individual comments and topics of repositories that can help understand the health of the community.
However, in these metrics, there are gaps in understanding human interaction. Communication in pull requests is conversational, and to understand the human interaction, we must analyze the back and forth interactions. To address this gap, we set out to define and build a process to evaluate constructiveness and inclusiveness in pull requests communications. But to inspect each pull request and the associated commentary is not only time consuming, but tedious and subject to human bias. To solve this problem, we chose to use machine learning and natural language processing to develop a framework to automatically assess pull request comments for constructive and inclusive communication. Our goal was to develop a repeatable, non-intrusive process that any open source project could adopt to aid in increasing the project’s inclusivity and thereby, the project’s health.
In the next part of this blog series, we’ll take a look at the tooling, the framework and the processes we built to help automate this analysis and provide suggestions for improved constructive communications.