bosh cloud_foundry products

Remote Dependencies, Convenience, Risk and Other Considerations for Operating Distributed Systems

IMG_0329 One deeply held principle by experienced distributed system operators that I have worked with is that you should have no external dependencies to your software other than the ties to minimum requirements of the OS such as common system libraries, utilities, and the kernel of the base OS. This approach should enable recreating a distributed system deployment without any dependencies on the outside world. When something goes wrong, you should have control over your own destiny. Reliance on any external dependency that is managed or hosted by someone else introduces risk that something outside your system can affect your ability to restore and recreate the system any time you need to.

To use a simple metaphor, imagine your system is represented by Jenga blocks and it falls over as Jenga towers inevitably do. However, instead of being able to rebuild your tower you find out that a mandatory required component at the base of your tower is missing or unavailable. No matter what you try, you cannot rebuild the tower exactly how it was before. Your new tower is going to behave differently in unexpected ways and you might topple over because you do not understand all the behaviors when using different building blocks combined in a different way.

Some of the original designers of the software deployment project for Cloud Foundry named BOSH (Mark Lucovsky, Vadim Spiwak, Derek Collison) embraced this principle and tried to create a prescriptive framework that encouraged this approach. They had experience managing large scale distributed systems at Google (the web services APIs). Kent Skaar also did similar for SaaS provider Zendesk. Given a software release that references specific versions of multiple software packages (known as a BOSH release), an instantiation of that release (a BOSH deployment) can be reconstructed at any time with the deployment configuration (a BOSH deployment manifest), the base OS images (the BOSH stemcells) and the software release (the BOSH packages and job templates for applying configuration); at any point in time, properly implemented BOSH releases of large scale distributed systems can be recreated without external dependencies. That means this holds true even when the internet is unavailable.

BOSH does give you the framework hooks to break out of this prescriptive principle and use external dependencies or at least external dependency formats if you choose to for convenience or other reasons. Dr Nic Williams recently implemented tooling to use apt packages instead of compiling from source. another example: some of the Pivotal big data software intentionally targets CentOS/RHEL only and therefore only ships rpm packages rather than compiling Hadoop. A guiding principle is that you should be mindful of the tradeoffs you are making of convenience vs risk and tying your release to only one OS distributor.

Examples of the tradeoffs:

  • relying on an externally hosted package manager like apt-get could affect the availability or correctness of that dependency when you need it most
  • relying on debian packages could prevent someone from using your release unmodified with a CentOS image

A recent real-world example demonstrated the risk of an external dependency changing unexpectedly. The coreos/etcd project that Cloud Foundry is using for storing stateful configuration data for the new Cloud Foundry Health Manager codebase had one of the dependencies (goraft/raft) force push to master of their git repository that overwrote some git history required by git to work properly. This situation has limited the flexibility of some users to make code modifications on several previous releases of Cloud Foundry without some tedious intervention.

A common reaction when learning about Cloud Foundry BOSH is to question the prescriptive guidance to compile from source when commonly used distributed package management systems exist in the Linux distributions. My recommendation is to understand the tradeoffs involved and make the best choice for your situation. You should explicitly call out external dependencies if you have them in your system. When your tower inevitably falls over, know how to rebuild it.

Thanks to Jose Hernandez for the Jenga image