More and more developers are being hired by technology companies to make open source contributions. If you are in that situation, you might think your employer is ultimately responsible for the code and other materials you share. But that’s really not the case. Anything you upload to an open source project is your responsibility and yours alone.
In this two-part series, we’re going to suggest some tips for preventing sensitive information from leaving your internal workspace when you make open source commits. First, we’ll cover proactive measures that you can take. Next time, we’ll look at what you can do once sensitive information has already appeared online.
It’s worth starting out by clearly stating what developers are responsible for when it comes to making contributions. For us, it boils down to six things:
- Acting in compliance with the project’s chosen open source license.
- Sharing high-quality code.
- Ensuring that your code conforms to current security standards.
- Running security checks on your code before pushing it out.
- Updating your dependencies – i.e. making sure you use the latest versions of pre-existing code that your own code builds on.
- Filing issues when you notice problems that need to be solved – and not sharing code that you know to be compromised.
If you maintain a project that your company has open sourced, you have additional responsibilities. You absolutely need to be checking that other employees aren’t sharing confidential or personally-identifying information. In addition, you always need to make sure that shared information is compliant with the project’s licensing agreements. Both of these expose your employer to potential legal action. Lastly, you need to watch out for information that either reveals or shares corporate IP.
Luckily, a variety of open source practices and tools can help you manage open source contribution workflows safely. To use Git as an example, here’s a typical workflow.
The basic commit loop shown here is pretty well known, but this workflow has an extra step: a pre-commit hook. This is a script you run just before the commit to check it and thereby decide whether you actually want to continue with the commit or exit (the “X” in the workflow). For that reason, it’s called a pre-commit.
Pre-commit hooks can save you an enormous amount of pain. They can detect public or private keys, for example, or AWS credentials. Other pre-commit scripts run validators that will confirm that a particular piece of code will be accepted when committed, or linters that check for consistent style or look for anti-patterns that might have security implications. You can find a library of pre-made pre-commits in a community project called pre-commit.com.
Once you have set up your pre-commit hooks, you can use templates to propagate them to your other projects. To do that, you can just use the git init tool, specify a template flag, and the template that you want to use.
Each of these can be useful in your project’s initial phase when you haven’t released it widely. They let you modify your git history and thereby ensure that your repository isn’t exposing any proprietary information.
Generally speaking, you really want to avoid modifying your history for both technical and ethical reasons. From the technical perspective, squashing commits together makes it harder to locate and then fix any problems that turn up. It muddies the issue of who should be notified when there’s a problem. Rebasing also changes your hashes, which are used in links and as references people use in their interactions with each other – whenever you break hashes, you break those connections.
From an ethical perspective, combining commits can complicate or obscure who gets the credit for any piece of code. Inclusive and diverse communities rely on people getting credit and when you don’t give credit, people walk away. So that’s another reason to be wary of changing your history.
That said, changing your history is a really simple way to avoid releasing problematic code or information – so it’s something to consider when in the origination phase of writing code.
Let’s look at the three tools mentioned in the graphic above in turn. The first tool is git-rebase that is both built into git and easy to use. Of course, that ease of use makes it easy to squash commits together, raising the technical and ethical issues we raised above.
The second tool, git-filter-branch, is part of git and quite useful because you don’t have to necessarily lose your commit boundaries. You can set up functions and variables and then filter either specific directories or a whole tree. You can look for the commits you need to modify and then decide how to modify them. Renaming tags that might have sensitive information is a powerful part of this tool, but that power has some attendant complexity. Like git-rebase, it still rewrites git history which can have a negative impact on the project and with its community.
A final tool worth pointing out is BFG, which is optimized for common branch filtering operations, making it both fast and pretty easy to use. Again, it lets you remove sensitive information and, on the negative side, it again means you are rewriting git history.
All of these tools are well worth considering before you release your project more widely. But for the reasons stated above, not all of them are ideal for addressing repository hygiene once a piece of sensitive information has been made public.
Next time, we’ll look at steps you can take when, despite your best efforts, you are in a situation where something went wrong and you need to be reactive instead of proactive.