ansible bosh chef configuration management orchestration Pivotal Cloud Foundry release engineering

Infrastructure as Code is Not Enough: Comparing BOSH, Ansible, and Chef – Part 1

We’ve all heard a thousand times that DevOps is about people and process, not just tools. While it’s true that tools can’t get you the whole way there, they can at least help on your way to achieving DevOps nirvana. Some tools help with build automation and testing (CI/CD). Others may be more about observability (APM). In this post, we’re going to examine a few of the ones designed for achieving the idea of “infrastructure as code.” They may be considered configuration management (CM), orchestration, or release engineering tools.

The Tools

The various solutions out there tackle this problem in a few different ways. Regardless, it almost always involves defining infrastructure through version controlled, machine-readable files. There are an endless number of options in this space, so we’ll focus on just three for now: BOSH, Chef, and Ansible. But why these three?

First, we tried to select ones that represent different approaches from each other. Ansible is agentless, while BOSH and Chef are not. BOSH is declarative, Chef is imperative, and Ansible can be both. They all use YAML to some degree, but Chef is Ruby-centric, Ansible leverages Python, and BOSH's CLI and Agents are written in Go, while the BOSH Director is written in Ruby.

We also wanted to look at one tool we haven’t seen covered too often: BOSH. While there are plenty of comparisons out there, BOSH is often overlooked. At Pivotal, we are proponents of BOSH and the Cloud Foundry community where it was born. We wanted to see how it holds up against more popular tools like Ansible and Chef. With a combined 20+ years of experience working in the IT and cloud computing spaces, our intent is to fairly assess and present each tool’s strengths and weaknesses in a balanced and accurate way.

Finally, a lot of comparisons we’ve seen only discuss pros and cons, or talk at a high level about what each tool can do. We wanted to pick a real life scenario and put each tool to the test. How does each experience compare with the others? What did it do well and what was it missing? To evaluate each, it’s first important to define what any good tool in this space requires.

The Criteria

To start with, we had to decide what real workload these tools would be managing in our demos. We wanted to use a cluster of servers, and we set out to pick something that's also familiar. Preferably it would already have some community-built automation for each tool as well. We settled on building and managing a RabbitMQ cluster, which meets all this criteria.

The best way to see how each tool performs is to take them all through a full lifecycle for typical application infrastructure. As such, we will evaluate each tool’s capabilities in the following five stages: Package, Provision, Deploy, Monitor, Upgrade. In this first post, we will cover the first three (Day 1) operations in depth. In a second part, we’ll dive into the last two (Day 2) operations.

Let’s take a closer look at what we are looking for from the tools in each of these areas:

Package

This is where the entire workload gets defined. It’s how the tool knows what to put on the server(s). Does the specific operating system matter? What additional software and settings will be required? Which version should be used? How should the software be installed and configured? A key feature here is the ability to define a specific, known, and verifiable version of a software release. It's also important to compare how each handles providing the operating system image.

Provision

Once the package is defined, how does the tool take care of building out the infrastructure to house the application? This is the provisioning step. It’s important to understand how each tool views the servers it’s building out. Are they pets or cattle? Depending on the application, one may actually be preferable. We are usually building out systems, not individual servers. In this case, the ability to support immutable infrastructure is often important.

Provisioning goes beyond the VMs as well. Can the tool handle automating or fully orchestrating supporting network or storage infrastructure? There may also be some overlap here with respect to VM sizing and operating system definition. Are these parameters part of the package definition or set at provision time?

Deploy

We’ve defined a package and setup the underlying infrastructure. Now it’s time to deploy the application. In some cases, the deploy step takes care of provisioning the infrastructure and deploying the application all at the same time, or at least in rapid succession. This is especially true in cases where we use immutable infrastructure.

The important thing here is that the deployment is always consistent. Is a deployment truly idempotent? Does it follow specific steps, or is it more declarative? We would expect the same build version of our application (and OS!) to get deployed everywhere, every time.

Monitor

Our infrastructure and application is up and running. It’s Day 2. How do we monitor it? Maybe our application is a platform that runs other applications. Or maybe it’s a service that other applications use. We may already have application monitoring solutions, but how do we handle the infrastructure itself? Does the tool handle things for us, or do we need something else too?

Of particular interest are any resiliency features here. What happens if a node goes offline? Does the tool handle self-healing? If so, how? Will it rerun only the app deployment, or create a brand new server? (Back to that immutable infrastructure thing again.) Does it support the ability to define what “healthy” means (like health check endpoints, etc.)?

Upgrade

Finally, we look at the most dreaded of Day 2 activities: the upgrade. How does the tool handle rolling out updates? What about at the operating system level? The ability to support zero downtime upgrades at all levels of infrastructure is very important here. But also, how do I know my upgrade worked? Ideally, the tool provides a way to be sure things work before proceeding. On a related note, how does the tool support scaling? Can I easily turn a 3-node cluster into a 6-node cluster?

The Evaluation

Now we’ll take a closer look at how each tool measures against these criteria. In this first post, we’ll look at just the Day 1 activities: Package, Provision, and Deploy. (In a followup post, we’ll finish off with Day 2 operations: Monitor and Upgrade.) Our code examples can be found here.

BOSH

BOSH is described as a “tool for release engineering, deployment, lifecycle management, and monitoring of distributed systems.” A central server called the BOSH Director is responsible for controlling lifecycle events. It was born in the Cloud Foundry community to install and manage Cloud Foundry instances. However, it can be used to manage any kind of application or infrastructure. Because of its roots, BOSH is particularly capable of handling distributed systems. In fact, BOSH was originally created by two former Google engineers and was heavily influenced by the Borg system used for cluster management inside of Google. Hence the name BOSH: borg++ (r+1=s, g+1=h):

Package

BOSH has a very clear definition of what a release is. It’s worth thinking of the package as a pairing of the release and the stemcell, or the OS image. The versions of these two things, as well as its configuration, are all put together in a declarative manifest written in YAML.

As part of a release, BOSH expects you to provide the bits to install rather than download them at deployment time. This is because everything gets compiled against the provided stemcell when it’s deployed, guaranteeing consistency.

Best practice for creating a BOSH release would ideally be to provide the source code. This way everything is compiled from scratch on the same OS image that it gets deployed on. Compilation can, however, be as simple as taking a tarball of an executable and unpacking it.

The declarative manifest, along with the fact that you provide the bits, means the same thing is deployed every time. Nothing gets changed out from under your feet between deployments.

In our example, the manifest defines a few things. It tells BOSH not just the release that we want to deploy (in this case RabbitMQ), but where to get it and what version to get. It also provides a SHA1 hash of the release to ensure everything is intact and contains what we expect it to.

releases:
- name: cf-rabbitmq
  url: https://bosh.io/d/github.com/pivotal-cf/cf-rabbitmq-release?v=233.0.0
  version: 233.0.0
  sha1: b5624d0528eec6aa4fb62c232687cd4e94644eb1

Additionally, we tell BOSH the name and version of the stemcell that we want to use to compile and run RabbitMQ on.

stemcells:
- alias: trusty
  os: ubuntu-trusty
  version: 3468.1

Finally, we provide some configuration specifics for our deployment. This includes things like the number of instances to deploy and how large they should be. There are also some things specific to RabbitMQ like what plugins to enable.

Provision

BOSH handles provisioning infrastructure as a part of the core deployment process, but it’s worth taking a separate look at how it does it. BOSH treats infrastructure as immutable. That is, when you’re updating the underlying OS, BOSH will blow away the old VM and create a new one instead of finding the differences between current and desired state.

BOSH provides support for many IaaS providers out of the box. It also provides a standardized interface for creating additional Cloud Provider Interfaces (CPIs). This abstraction defines standard operations needed to interface with the infrastructure. That includes things such as creating a VM, attaching a disk and connecting the VM to the proper network. This simplifies the packaging and deployment process with a consistent experience across providers.

One component of the infrastructure that does need to be handled outside the scope of BOSH is the networking configuration. Things like subnets and firewall rules must be created ahead of time. These are often taken care of using something like Terraform scripts to automate the process.

Deploy

Deployment with BOSH is done in a similar way as other CM and orchestration tools — with a CLI command. First, a BOSH environment (director) must be set up with the appropriate OS images (stemcells) uploaded to the environment. At that point, there are two choices for providing a release in the manifest. Specify whether to pull the release tarball from a specific URL, or get it from a local directory. To use a local version, it must first be uploaded using the bosh upload-release command. In our example, we provided a URL to a specific version, along with a SHA1 hash to verify the release bits. After specifying the release location, deploy with the following command:

bosh -e <environment-name> -d <deployment-name> deploy <path-to-manifest.yaml>

Using the bosh vms command, we can verify the VMs that were created as part of this release. Because BOSH knows the desired state as defined in the manifest, it takes care of the VMs, OS, and software deployment all together. Once deployed, BOSH will continue to monitor this state to ensure it remains constant. If a process on the VM crashes, Monit will handle catching it and restarting it. If a VM drops off the network or disappears, BOSH will blow away the instance, if one still exists, and create a new on in it's place. It has all it needs to recreate the desired state.

Ansible

Ansible claims to be a “radically simple IT automation engine” for automating cloud provisioning, configuration management, and application deployment. One of the key features keeping it simple is the fact that it’s agentless. There is no central server to connect to and no agent to install on a server. It simply connects via SSH from a controlling machine (client) to run the automation.

Package

Ansible organizes everything into playbooks. Playbooks are Ansible's way of describing deployment and configuration using YAML. Many Ansible modules are available to do everything from run shell commands or interface with the OS package manager, to create VMs in AWS. Ansible does provide some abstractions to things like package managers. While this can help with portability, playbook writers must take special care if they plan to target multiple distros.

A good example of this is to look at what we used to deploy RabbitMQ. The GCE playbook is responsible for creating the VMs and ensuring they come online. Additionally, the rabbit playbook takes the info from the machines that were created and then leverages the community-created role to download, install and configure RabbitMQ. It’s worth taking a look about how that community release operates as well.

Provision

Ansible provides flexibility and options on how to approach provisioning. When compared with some of the topics we've discussed, it comes at the cost of consistency. It does allow for easy support of IaaS providers through use of one of the custom modules. There are many module options provided out-of-the-box or via the community as well. In our example, we use the GCE module:

- name: Launch instances
    gce:
      name: ansible-rabbit
      num_instances: 3
      machine_type: "{{ machine_type }}"
      image: "{{ image }}"
      service_account_email: "{{ service_account_email }}"
      credentials_file: "{{ credentials_file }}"
      project_id: "{{ project_id }}"
      zone: "{{ zone }}"
      tags: rmq
    register: gce

Provisioning acts like any other task in Ansible where the customizability is great. That means you’re writing and maintaining a playbook for your infrastructure. The modules are IaaS-specific, so if you’re running on multiple providers, you’ll be maintaining multiple playbooks. Playbooks are generally written to look at the existing hosts, gather information about them, and figure out what needs to be done to bring them to the desired state. This is in comparison to BOSH, which will simply delete the old hosts and spin up a new one with all the desired updates.

Where Ansible takes the upper-hand over BOSH is when it comes to network management. Subnets, firewall rules and load balancers are created and modified via playbooks. As we saw, with BOSH they must be created ahead of time either manually or with an external automation tool such as Terraform.

Deploy

Once again, we have a CLI command that will run the deployment. Ansible is agentless, so there is no central server or infrastructure to setup ahead of time. The only prerequisite is that the client can access the target infrastructure via SSH. As mentioned, the target infrastructure itself is specified in a part of the playbook. So it will take care of SSH access as part of provisioning. The following command runs the deployment:

ansible-playbook <path-to-playbook.yaml> --extra-vars "key=value"

The first time the deployment is run, the necessary VMs will be created. In our example, we used the Google Compute Engine (GCE) module to launch the instances. Then, with the “add_host” module, the playbook creates a temporary, in-memory group so that subsequent plays in the playbook can manage the machines in this group. This is how the rabbit.yml playbook takes care of installing RabbitMQ on each of these hosts.

- name: Save host data
    add_host:
      hostname: "{{ item.public_ip }}"
      groupname: gce_instances_ips
    with_items: "{{ gce.instance_data }}"

What happens the second time the deployment runs? The GCE module recognizes the VMs exist already and does not recreate them. If one is missing, it will create a new VM. However, without an agent, there is nothing to proactively monitor the VMs and determine whether or not one needs to be recreated. The deployment must be run either manually or as part of a scheduled job before a VM would be created again. This example shows how the GCE module works. Each module can act differently and playbooks work independently of each other. As a result, it may be challenging to guarantee idempotence.

The playbook that installs the software itself may also be written in different ways. Thankfully, the playbook we reference, jasonroyle.rabbitmq, happens to allow for specifying a specific version of the software to download. It could just as easily have downloaded the latest version instead. This method would have made it so subsequent deployments might differ from previous ones.

So playbooks may be written in such a way that deployments will always run in a consistent manner. However, with Ansible it's also possible to write them in a less declarative, more sequential manner. In short, deployment consistency is extremely dependent on how the playbook is written. Best practices to ensure idempotency should be followed as much as possible.

Chef

Chef is a configuration management tool. It's a common “infrastructure as code” tool often referenced in the DevOps community. Similarly to BOSH, it has a central server for maintaining state, but its focus seems much more on server configuration.

Package

Like Ansible, Chef has its own way of describing the process of installing and configuring software. Chef's abstractions that help with portability are called cookbooks. Likewise, we turned to the Chef Supermarket for the RabbitMQ cookbook. The cookbook provides the ability to install, configure and cluster RabbitMQ nodes. These cookbooks provide instructions on every step needed for the deployment, much like Ansible. However, instead of YAML, Chef cookbooks are written in a Ruby DSL. This format allows for helper code to handle more complex deployments. For example, the RabbitMQ cookbook allows the operator to specify the package manager version or to download from the RabbitMQ site, as seen here. We also specify a "rabbit_cluster" role that combines specific recipes from the cookbook to define which ones to run on each server:

name "rabbit_cluster"
description "RabbitMQ Cluster"
run_list "recipe[rabbitmq::default]", "recipe[rabbitmq::mgmt_console]", "recipe[rabbitmq::plugin_management]", "recipe[rabbitmq::cluster]"

Provision

Also like Ansible, Chef allows for quite a bit of flexibility on how to create and manage infrastructure. In our example, we used the knife-google plugin. This lets us create machines, bootstrap, and assign roles all in a single command. Chef also benefits from a large ecosystem of existing modules that can be leveraged for a wide array of infrastructure providers. Cookbooks can be written to do things that BOSH can’t.

In our example we used the knife-google plugin to stand up a machine, bootstrap it and assign the appropriate role all in a single command. Alternatively, it could all have been expressed in a recipe.

Deploy

With the knife-google plugin, one single command takes care of everything. It creates the VMs and assigns the RabbitMQ roles to them so that Chef knows the desired state of the servers. The command looks like this:

knife google server create rabbit1 -Z us-east1-b -m n1-standard-1 --gce-project <project-name> -I ubuntu-1604-xenial-v20171028 --node-ssl-verify-mode none --no-node-verify-api-cert -x <ssh-user> -i <path-to-ssh-key> -r 'role[rabbit_cluster]' -T rmq

The knife-google plugin makes provisioning the infrastructure a little more convenient. As a configuration management tool, Chef is more focused on bringing hosts to a desired state. Once again, this is in contrast to BOSH’s approach to immutable infrastructure which may be better suited for managing a fleet of servers as an application unit or cluster.

Like BOSH, Chef does have a central server which will actively work to keep the servers in the desired state. However, like Ansible, it is primarily focused on figuring out what needs to be done to bring the server into the desired state. This means the deployment of the application and configurations is pretty consistent. However, Chef won't go as far as to manage the OS or underlying VM itself.

Evaluating Day 2 Operations

In Part 2, we’ll examine how each of these three tools handles the Day 2 operations covered in the areas we mentioned: Monitor and Upgrade.

This post was co-authored with Brian McClain.