In Part 1 of this series, we discussed characteristics that promote healthy open source communities, namely the role played by constructive and inclusive communications. Some communication is on the surface, easily spotted and swiftly remedied if and when it’s off target. But the inner workings of a project, embedded in timeliness of pull requests (PRs) reviews and the contents therein is where communication can go awry, be insidious, and go unnoticed. By improving the exchange between reviewer and author, a project can quickly change from a hostile to a welcoming one. And that can make all the difference between a project that’s thriving and one that’s stalled and shrinking. We set off to build automated tooling that projects could adopt to help them not only assess, but also improve their project’s communication.
Building a Dataset
To develop a framework to automatically assess pull request comments for constructive and inclusive communication, we first needed a dataset to work. To build and test our framework, we decided to use an open source project, Kubeflow KFServing, as the test case project. Kubeflow is a project familiar to many VMware engineers, so we felt comfortable using this project—in addition, the ~900 closed pull requests was a good match for the number of datasets we wanted for this experiment.
To train a machine learning model, we first need a set of labels indicating constructiveness and inclusivity of the comments found in pull requests. A group of volunteers from VMware annotated 400 samples using Data Annotator for Machine Learning to provide a set of observations that contain a pull request communication thread and labels of constructive and/or inclusive. Our dataset contains all communication on Kubeflow KFServing Pull Requests: title, description, comments, and reviews. We only annotated 400 samples since data annotation is a time consuming process and wanted to preserve substantial data for unseen predictions to validate results. However, since our observations contain strings, and processors can only perform operations on numbers, we needed to transform those strings into numbers.
Using word2vec (a frequency-based word encoder) followed by padding and truncation, we obtained a set of 512 integers representing each comment or review. A unique integer key represents each word in the original text. The resulting set of numeric observations were now ready for machine learning tasks. Figure 1 below illustrates the steps in converting a string of prose input to numeric encoding.
However, using a single string to represent all comments and reviews on a pull request would represent the communication as an “essay” rather than a back and forth conversation. Preserving that communication context is critical to understanding where the problems lie: with the author or with the reviewer(s). Therefore, we attempted to represent this conversational information by:
- Obtaining a row encoding for the pull request description and each comment and review in the pull request.
- Stacking these rows to form a 2-dimensional matrix
- Duplicating the 2-dimensional matrix, with one representing communication from the author and the other from reviewers.
- Replacing rows representing comments and reviews from the author in the reviewer matrix with null rows and replacing rows representing comments and reviews from the reviewer in the author matrix with null rows.
- Stacking these two matrices together to form a 3D matrix which forms the final input. The final input shape is the number of comments/reviews x 512 (length of each row) x 2.
Padding and truncating in Step 1 and Step 2 ensures that the final size of the 3D matrix is consistent for all Pull Requests. (See Figure 2.)
By encoding comments and reviews individually and separating them into two distinct layers, the structure helps represent and preserve their individual aspect/individual messages, and captures the exchange flow (aka rows), stemming from different entities (the third dimension of the matrix).
To summarize, in Part 1 of the series we discussed the importance of identifying constructive and inclusive feedback in open source. In this part, we described our experiment, the tools, and the challenges in building an ML model based on text “conversations.” In Part 3, we will discuss how the data is used to train machine learning models that predict constructive and inclusive pull requests, and how the model identifies positive and negative contributors within each comment and review.