news products Tanzu RabbitMQ

Announcing the Release of VMware Tanzu RabbitMQ 1.2

The Tanzu RabbitMQ team is excited to announce the general availability of VMware Tanzu RabbitMQ 1.2.

This version contains RabbitMQ 3.9, a milestone core broker release that introduces Streams, a new data structure allowing for replay and higher throughput.

In addition to Streams, Tanzu RabbitMQ 1.2 has some exciting new capabilities and improvements:

  • Warm standby replication (active/passive) for disaster recovery (see full details below)

  • Support for HashiCorp Vault as a source for cluster secrets that can be used instead of Kubernetes secrets

  • Enhanced support for RabbitMQ clusters on Red Hat OpenShift

  • Improvements for monitoring queue health

  • A number of bug fixes and stability improvements (see the release notes)

Enterprise-grade disaster recovery for RabbitMQ

Tanzu RabbitMQ has a new mechanism for disaster recovery that allows customers to easily configure standby replication. This capability provides the following:

  • Fast, data-safe message replication – Uses the latest in RabbitMQ protocols and best practices

  • Automatic downstream cluster protection – Automatically prunes messages that have been processed on the active upstream site

  • Easy setup – No need to calculate or assume message throughput rates to configure message expiration/TTL (time to live)

  • Provides faster failover – Downstream applications will only see messages that haven’t been processed on the primary site, reducing the time to recover

Previously, users had to manually configure the federation plug-in, as well as set appropriate message TTLs using an estimate of publishing and consumption speed. This introduced risk into a disaster recovery setup as publishing and consumption rates cannot always be guaranteed in distributed systems.

The new solution does not make any assumptions about publishing and consumption speed, queue depth and memory size; instead it avoids storing messages in the passive cluster queues altogether. It replicates not just the messages, but also information about whether the message has been processed. This tells the passive cluster which messages don’t need to be retained anymore.

How does it work?

Here is a breakdown of how the new data replication works:

  1. The user defines which vHosts will be backed up and which queues will be backed up using a policy.

  2. Every message written to a Quorum Queue matching the replication policy on the active cluster is logged into the local replication log. There is a replication log for each vHost with a replication policy. 

  3. The active cluster logs some metrics in a defined short interval. These will be used by the passive cluster to truncate consumed messages.

  4. Clusters that are configured as passive followers establish links to the active cluster and register as consumers for the replication log. These passive clusters get log entries pushed to them continuously as new entries are logged. A passive cluster can be linked at any given time, and there can be multiple passive clusters per active cluster.

  5. The passive cluster logs these entries into equivalent local replication logs.

  6. When the passive cluster is promoted to become active using an API call, it will read from the unconsumed messages from the local replication log and will write them into the local queues.

 

As a result of this design, the passive cluster is not only protected from queue overflow or losing data by premature message deletion due to a poorly estimated TTL, it also enjoys near real-time replication (only limited by network speed) and quick recovery time objective (RTO) since minimal unconsumed messages need to be enqueued for processing.

This capability is available as part of Tanzu RabbitMQ. The easiest implementation path is to deploy Tanzu RabbitMQ on Kubernetes, where a new operator automates the configuration of replication on both clusters (active and passive), making setup fast and simple.

An application team can now be given not just self-service creation of their RabbitMQ cluster but also the ability to self-create a disaster recovery standby cluster with replication. A task once done by Ops SMEs, often taking weeks to months, can now be done by each application team in a matter of a few minutes.

Integration with HashiCorp Vault 

We understand that Tanzu RabbitMQ is one of many different workloads that our users want to manage in their Kubernetes environment. We know that our users typically have a central process for managing secrets, and we would like to help them use the DevSecOps toolchain of their choice. We chose to start with HashiCorp Vault because it is one of the most popular secret stores for microservices and cloud native applications. Vault has several ways to integrate with Kubernetes workloads, and in this version we have integrated our cluster operator with the Vault agent in order to allow users to manage RabbitMQ default user credentials as well as TLS certificates using Vault.

The flow is quite simple and works with a user’s own process to store the secrets in Vault:

  • Install the Vault sidecar injector on the Kubernetes cluster

  • Configure the Vault agent context using the cluster operator API YAML

  • Once the user applies the YAML, the cluster operator will create a RabbitMQ cluster. Each pod of this cluster will have the annotations that will cause Kubernetes to inject the Vault sidecar container into the pod. This means that next to each RabbitMQ node, there is a Vault agent. The Vault agent can now authenticate with the Vault server using the Kubernetes service account token and expose the Vault-managed secrets to the pod containers as a file system mount.

While RabbitMQ will read the certificates from the file system and doesn’t need any additional steps to use a rotated certificate, rotating the default user password requires an additional API call. An additional sidecar does exactly that, watching the Vault agent file system and rotating the password when it changes.

Here are a few examples:

Monitoring queue health

RabbitMQ exposes health metrics through a Prometheus endpoint. There is an existing /metrics path that exposes both general metrics about Erlang VM and also RabbitMQ-specific metrics about queues, connections, etc. By default, this endpoint operates in the aggregated mode, so that only totals can be observed, i.e., the total number of ready messages in all of the queues. It can be switched to a per-object mode (via configuration setting, or by using /metrics/per-object URL), where separate objects can be seen in metric labels, such as queue and exchange names.

However, per-object mode exposes every metric known to the system, which can be unreasonable when there are a lot of objects. That’s why a new /metrics/detailed was introduced, which allows one to collect only those per-object metrics that are of interest to them, or filter them on a per-vhost basis. Collecting only the number of messages and number of consumers per queue is significantly faster than collecting everything, yet provides enough information for meaningful monitoring.

In addition, observability examples are now configured to scrape this new endpoint for a minimal set of per-object metrics, which allow the following two preconfigured Prometheus alerts:

Get started

Already a customer? Download the new version. Not yet a customer and want to learn more? Read the documentation