In Part 1 of this post, we set up some criteria and began to examine a few different “infrastructure as code” tools used for automating the management of servers or application systems. In the first part, we took a look at just the Day 1 activities: Package, Provision, and Deploy. Now we’ll take a closer look at how each tool measures against the criteria for the Day 2 operations we outlined: Monitor and Upgrade.
Evaluating Day 2 Operations
BOSH
Monitor
BOSH monitors both the VMs (via the BOSH agent installed on each VM automatically) and the processes running on those VMs (via Monit). Every VM managed by BOSH will continually send a heartbeat back to the BOSH director. Should those heartbeats stop, it will recreate the machine if it goes offline for any reason.
Additionally, BOSH ships with several plugins to help integrate with tools you may be using today for alerting. Integrations are available for tools such as PagerDuty, Datadog, CloudWatch or even just good, old fashioned email. The BOSH agent also integrates with Monit on each machine. That means these features extend to the process level as well. As for the definition of healthy, Monit allows for a number of checks, including the process itself.
The architecture of BOSH also allows you to colocate jobs that will collect metrics and send them to either on-premise monitoring and reporting solutions or hosted solutions. For example, we can leverage the bosh-hm-forwarder to send the metrics that BOSH collects or setup an existing release for a solution such as Prometheus.
Upgrade
Patching and updates are perhaps BOSH's biggest area of strength. BOSH is very full featured here, allowing for easy scaling and zero downtime rolling updates. Let's start with simple scaling. We can turn our 3-node RabbitMQ cluster into a 6-node cluster with a small manifest change and CLI command. We update one line in our deployment manifest, specifying 6 instances instead of 3. Then we run the deploy command again. BOSH takes care of getting the environment into the desired state, spinning up three new VMs. Scale back down the same way, changing from 6 to 3.
The nice thing is, this is the pattern for all upgrade functions. Update the OS version? Just change the stemcell in the manifest and redeploy. If the application needs to be recompiled for the new OS, BOSH does it. Need to change the application version? Provide the new bits for the release, specify it in the manifest, and redeploy. Because BOSH owns the VM and OS provisioning as well as the bits to install, this process is a pretty simple and consistent experience.
But BOSH goes beyond just simple. It does upgrades in a smart way too. The manifest file allows operators to define an update policy. To perform rolling updates, the number of VMs that should update at the same time can be set. It also allows for specifying canaries to verify things went well before moving on. When updating the OS, it's not changing the VMs, it's recreating them in the new state. This is more secure as well, following the "repave" directive from the three Rs of security. With these constructs, BOSH provides a consistent way to safely perform zero downtime upgrades.
update: canaries: 1 canary_watch_time: 30000-180000 max_in_flight: 1 serial: false update_watch_time: 30000-180000
Ansible
Monitor
Out of the box, Ansible doesn’t provide automatic mitigation to outages. Being agentless, there’s nothing monitoring your machines or their processes by default. Instead, Ansible provides ways to interact with many popular existing third-party solutions such as DataDog, Pingdom and NewRelic. Many monitoring services also provide playbooks to help install and configure their agents. Some even monitor the Ansible runs themselves.
Much like how we turned to Ansible Galaxy for the RabbitMQ role, we can do the same for monitoring. From here, we can see roles to install and configure metric collection agents or setup on-premise monitoring solutions. This does mean monitoring and autohealing is up to the operator to implement. However, operators can leverage these roles to integrate with existing monitoring solutions.
Upgrade
Ansible's flexibility offers different ways to handle upgrades. As always, the method is largely dependent on how the playbook is written. In the case of our example, scaling up turns out to be as easy as it was with BOSH. Change the number of instances specified in the gce.yml file from 3 to 6, and rerun the playbook. Thankfully, the GCE module knows how to take care of things. If it didn't, or if the playbook relied on a specified inventory of servers, things would look different.
Scaling down, however, isn't the same. Change from 6 back to 3 and the module doesn't know it has to delete 3 servers. That becomes a separate process. If it gets missed, you might end up with servers you don't need! Also, if we move from GCE to some other IaaS provider, the whole playbook has to be rewritten. So the consistency is limited.
As for upgrading the OS or application, it again comes down to the playbook. In our GCE example, simply changing the image name and redeploying does not upgrade the OS. With BOSH, updating the stemcell in the manifest declaratively determines the state. In the Ansible case here, we'd have to delete the servers first. Then the the playbook will recreate them with the right OS.
It is possible to build some extra steps into the playbook to check the OS first, or something like that. However, it's not something that's built into Ansible or as part of the GCE module. The same is true for the application itself. The playbook can be written in such a way to install as well as upgrade, but some thought has to be put into to writing it that way. Does it just download a new version and install it, or will it replace the existing version? Without the immutable infrastructure concept built in, it's harder to guarantee and will be determined more on a case-by-case basis.
Ansible does have some concept of rolling updates. There is a parameter called `serial` that can be set to define how many hosts to manage at once. There is also something similar to the canary concept. The "max_fail_percentage" parameter will stop the updates once reaching a certain failure threshold. Ansible also supports pre-run and post-run tasks, so there is the possibility for checks and rollbacks to be built into playbooks. Again, it's flexible and up to the playbook author to determine how to do zero downtime updates.
One quick note on what constitutes a failure since Ansible has many error handling options. A failure in Ansible is determined based on the return codes of commands and modules. It does provide a way to ignore failures as well as to define "custom" failure scenarios. The main thing to be aware of is a failure is tied to the success or failure of the most recent step run. There is no agent, so no way to know the longer term health of a process.
Chef
Monitor
Much like Ansible, we have a lot of flexibility when it comes to monitoring solutions at the tradeoff of an out-of-the-box solution. Where we have auto-healing and alerting with BOSH, Chef leaves it up to the operator to set up and configure their monitoring solution. We again can turn to the community for a solution to this, either installing an agent for a hosted solution like New Relic, or we can set up something like collectd to integrate with our on-prem metrics collection solution.
Chef does allow for the operator to configure events that happen within the system, such as deployment failures, and there are a few monitoring solutions out there that integrate with this system, such as Datadog. This does allow the operator a unique view into the metrics of the deployment itself, which can be quite useful. What happens long term though is dependent on how the environment is configured.
Upgrade
It’s possible to use Chef for upgrading and scaling applications. However, it’s not quite as straightforward as what we’ve seen with BOSH or even Ansible. In our Chef example, the knife-google plugin provisions and configures all at once. With this paradigm, scaling means running the command more times with different hostnames. That will create additional machines. Had we gone the route of building provisioning into the recipe, there might have been a better way scale VMs. With our example, we'd probably look to scripting to enhance the knife-google command.
Similarly, rolling updates, canaries, or OS upgrades, are not native features of Chef. All these things are definitely achievable through the use of custom recipes. They can all be done with a combination of tools, but it's left to the operator to handle on their own.
The Wrap-up
So which tool should you use? As with any tool selection, it depends on the use case and it comes down to some trade offs.
Both Chef and Ansible do offer some degree of flexibility. Sometimes that means certain tasks are manual or require additional components or solutions. BOSH takes a more opinionated approach. It embraces best practices for operating distributed software in a cloud environment. BOSH simplifies many operational tasks while sacrificing some flexibility and one-off management solutions. Patching, upgrades (both app and OS), health monitoring, and self-healing are all provided out-of-the-box. All three tools can achieve zero downtime deployments, but BOSH provides it as a core concept.
Additionally, BOSH aims to remove uniqueness in environments. It helps ensure consistency and forces a separation of state and infrastructure. Operating system and application management aren’t seen as separate things for BOSH. Stemcells and releases ensure the OS and software versions are explicitly defined. Immutable infrastructure guarantees the machine state will always be as declared. A machine in a BOSH deployment running stemcell version 5 will always be the same as every other machine running stemcell version 5.
Unfortunately, BOSH maintains a much higher barrier to entry than Ansible and even Chef. There is still a steep learning curve despite some recent steps to simplify BOSH. Though Ansible uses YAML as well, the BOSH manifest can feel more intimidating than writing playbooks. Plus, BOSH has fewer available pre-packaged solutions compared to its competitors. This smaller repository of BOSH releases can be off-putting to adopters who are new to the ecosystem.
Ultimately though, BOSH treats distributed systems as first-class citizens. Complex server clusters can be contained in a single BOSH release. Ansible and Chef on the other hand are more concerned with individual hosts. They will likely end up representing a distributed system using multiple roles. So which tool you use will depend on the kind of workloads and architecture you are looking to maintain. Choose wisely.
This post was co-authored with Brian McClain.