The build and release process has always been a central part of good software engineering practice. Specific attention to how software is produced, however, has increased in the past year given the recent surge in supply chain attacks. Full traceability to source and confidence in the integrity of delivered artifacts built during your software compilation process are important defenses against supply chain compromise. Having a fully reproducible build process reduces the need to trust build services because it’s easy to confirm that no malicious backdoor injections have taken place regardless of where the build is performed (dedicated production machine, CI instance or even your laptop).
Reproducible builds can also provide assurances around what software has been and will be shipped. If you know that your build process can be 100% bit-for-bit reproduced when given the same set of build inputs, you can trace any release, past or present, back to source. This is helpful for things like recreating and debugging customer issues when they arise in a release or determining if a developer build system has been compromised.
Reproducible builds require things like a deterministic build script and caching of intermediate build artifacts. This, in turn, requires a thoughtful method for storing those build artifacts leading to things like storage optimizations and improved content availability. Ultimately, implementing reproducible builds encourages a level of overall thoughtfulness when delivering software and improves the general reliability for your software.
Three common interpretations of “reproducible build”
The term “reproducible build” is often overloaded with definitions and expectations for behavior that differ in the space between intention and practice. At a high level, there are several reasons why reproducibility is not achieved in practice despite the intention to do so. For example, most build systems were developed in more innocent times or with different priorities, i.e. enabling new users to onboard quickly. Another factor is that modern software systems are extremely complicated and composed of multiple components and layers. Unless you are bootstrapping these complex builds completely from source (a non-trivial effort, to put it mildly), it’s likely that the interaction of these different ecosystems can work against you. And finally cost, both in the form of engineering effort and long-term storage of artifacts. So what exactly makes a build reproducible? There are at least three ways to define it: repeatable build, rebuildable build and binary reproducible build.
1. Repeatable build
Repeatable builds control the steps for a build. This implies that the build is scripted in some way and, therefore, can reliably be repeated with some level of guarantee that the same steps will execute in the same order regardless of where or when the build is run. However, repeatable builds are usually temporal. If a repeatable build script is run today and again in the future, it’s not likely that the same execution path will be followed. A major contributing factor to this is that modern software uses a significant amount of layering and third-party or open source components that are typically retrieved at build time from an external system outside of the builders’ control. To make matters more complicated, these dependencies are usually not lifecycle managed and, as a result, often introduce opportunities for the build to behave differently at different points in time. Even when dependencies are pinned to a specific version in a build, there’s no guarantee that those artifacts will be available in the future if the building entity does not control their existence.
Apart from the dependency problem described above, repeatable builds are not truly reproducible due to the fact that many current tools generally capture too much of the local system state. This means that different build systems and different build environments are often reflected in the final artifact (i.e. timestamps, paths, locale), thus producing different outputs.
Repeatable builds are a good first step toward true reproducibility. They give you confidence when you need to reliably perform the build again or release a build on short notice, like is required when urgently issuing a security fix. In this sense, they’re important for developer sanity but they aren’t going to provide any benefits for security or compliance monitoring or prevent a situation where your build breaks if third-party components are no longer available.
2. Rebuildable build
Rebuildable builds control all explicit inputs for a build. In addition to having some sort of repeatable build script, a rebuildable build process utilizes infrastructure systems that capture some of the states not otherwise controlled in a repeatable build process (i.e. intermediate build artifacts and the artifact repositories where they are stored). By controlling the artifact repositories where dependencies are stored, you are able to produce an equivalent (but not identical) artifact that can be reproduced at any arbitrary future point in time.
The storage and control of dependencies and intermediate build artifacts is a key aspect to rebuildable builds. Controlling these explicit inputs for a build offers protection against a scenario where build artifacts may become unavailable at a future date. This might happen maliciously or harmlessly like when distributions garbage collect old versions of packages when publishing newer versions or when files are moved on a server. Rebuildable builds also improve the reliability of the entire development and release process and ensure successful business continuity when providing delivered software and support to customers. If an old version of software in the field encounters a bug that needs to be reproduced and resolved, an engineer will only be able to provide adequate support if they are able to re-generate the correct version of the software.
Of course, there are tradeoffs to achieving rebuildable build status. This fee primarily presents itself in the form of infrastructure costs. If you’re mirroring dependencies and storing intermediate build artifacts, you will need some place to store them, and, with any sort of scale, that storage is not going to be free. The specific cost to store or mirror your dependencies might be a factor in how you move toward implementing rebuildable builds.
3. Binary reproducible build
Binary reproducible builds, also referred to as “deterministic” or “hermetic” builds, control all states of a build. When a binary reproducible build is re-run with the same fully detailed inputs, bit-for-bit identical outputs are generated regardless of who runs the build or where and when it is run. Achieving binary reproducible build status is no small feat. It requires a fully defined build environment with no uncontrolled build inputs and necessitates control over all dependencies and intermediate build artifacts. A binary reproducible build is inherently reproducible all the time but requires a significant amount of effort to achieve. The amount of work required depends on the language and toolchain in use and tends to scale with the complexity of the system. The effort it can take is also the reason that, despite the fact many smart people have been working on this problem area for years, a binary reproducible build process is not achieved by default in most workflows. The work it takes to achieve binary reproducible status, however, is well worth the effort for the supply chain benefits received in exchange.
Primarily, if your builds are achieving binary reproducibility, you don’t have to implicitly trust the machine where the build is taking place. This is important considering that most modern developers don’t actually control their build environment – developers are building on cloud services, CI infrastructure as a service, etc. Decoupling your build from the machine where it runs provides flexibility and peace of mind when choosing a build system. While providing all of the benefits discussed elsewhere in this article, put simply: reproducible builds are good for the engineering effectiveness and reliability of your entire project.
TL;DR
The term “reproducible builds” is often overloaded with context but truly reproducible builds are those that, given the same inputs, can produce identical binary artifacts regardless of the build machine or when the artifact was created. They are an essential part of securing the software supply chain and help ensure that software vendors know exactly what’s being shipped. This enables them to quickly pinpoint vulnerable components and remediate fixes in light of a vulnerability or exploit. For open source projects, they allow users to verify that the built artifacts match the source code in the repository. Even if it’s not possible for your software to achieve binary reproducible builds immediately, taking small steps toward binary reproducible builds is a worthy effort and benefits the entire software development ecosystem.
Stay tuned to the Open Source Blog and follow us on Twitter for more deep dives into the world of open source contributing.