We get a lot of questions from enterprises on the best way to run Pivotal Cloud Foundry. A key topic: how to run the platform across multiple data centers and multiple clouds. To help platform teams accelerate their initial deployment, we detailed several best practices in a recent white paper: Multi-Site Pivotal Cloud Foundry Reference Architecture. In this post, we summarize our recommendations from this paper.
When you decide to standardize all your enterprise application development on Pivotal Cloud Foundry (PCF), the immediate follow-up question is “how do I ensure this platform remains highly available and disaster resilient, at scale?”
Granted, just about every PCF adopter starts with a single site. This is the simplest and fastest option. We publish and maintain handy single-site architectures for all major public and private clouds. One PCF “foundation” also happens to deliver excellent uptime and availability for your applications. For a few enterprises, a single site may serve them well for several years.
The majority, though, end up with a multi-site PCF deployment sooner rather than later. When you’re moving your entire enterprise application estate to a cloud-native model, geographic redundancy and resiliency matter. These organizations sparked our interest in simplifying multi-site patterns, captured in a previous paper.
To build on this excellent work, we will take a closer look at three popular configurations for Pivotal Cloud Foundry across multiple sites:
-
Active-active is two fully functional PCF foundations deployed and primed with live applications to serve traffic in case of failure.
-
An active-passive configuration, by contrast, features one inactive foundation that acts as a “standby” site.
-
A “stretched” deployment has a single PCF deployment that spans across two data centers.
All three configurations are used in production by enterprises like yours. And all three are perfectly wonderful choices, depending on your requirements. Check out the full paper for the full technical deep-dive.
Deploying PCF across sites : Multi-Site Pivotal Cloud Foundry Reference Architecture: A Technical Comparison https://t.co/FIbjSYmyyf
— Saqib (SA-QiB) (@svsaq) October 23, 2018
In the meantime, we wanted to summarize a few of the most important best practices discussed in the paper.
1. Start simple.
In general, we recommend that every platform ops team take an iterative “simple first, sophisticated later” approach. That means starting with a single site to start, then scaling up to additional sites. This is a great option for the majority of PCF adopters.
But it doesn’t always fit. If your organization has prolonged hardware provisioning times, you might just want to do two sites at once. Or if you need geographic redundancy right away, you may need that secondary site immediately. That leads us to tip #2!
2. Active-passive is an excellent default choice.
The active-passive architecture has two PCF foundations, typically deployed in separate data centers. One foundation is “active,” serving as the primary foundation for app traffic and platform services. The other is “passive,” and is only put into active use if the other foundation is unable to serve requests for a given amount of time.
This architecture is a good choice—especially if you can implement a sufficiently fast-failover process. This way, you can minimize the costs and complications of day-to-day operation. In fact, we recommend active-passive as the default option for most enterprises’ first production environment that requires high availability with a second site. Let’s explore why.
A common active-passive reference architecture for PCF.
In the image above, the focus of this architecture is a single PCF foundation that is permanently “active.” A second foundation, the “passive” one, can be brought online as needed. Another option to mull over: make both foundations active, but have the secondary site operate at a much smaller scale. You can always scale up in times of distress or increased traffic.
In fact, a passive foundation can be made “cooler” or “warmer,” to suit your preferred balance of cost and uptime. A “cooler” passive foundation (i.e. a PCF foundation deployed at bare-minimum scale) will be more cost-effective but will take longer to bring online when needed than a “warmer” passive foundation. The latter will cost a bit more but will enable faster failover. The warmer the passive foundation, the closer it starts to resemble an active-active architecture.
We’ve seen many customers deploy an active-passive configuration to start. From there, they learn and adjust the “temperature” of the passive foundation.
Active-passive is an excellent way to ensure greater availability without excessively increasing cost and operational complexity. It’s also much easier to change later. You can ratchet capacity up or down to match your desired level of platform availability.
3. The most successful platform teams aren’t defined by their architecture. They are defined by their methodology.
Your platform architecture can be optimized for any given use-case.
In fact, the enterprises that run PCF at the most impressive scale feature different architectures for different environments. They tailor these architectures based on the needs of the business at any given time, for any number of circumstances.
Of course, you need the wisdom to use the right architecture for the right scenario. How do you get to this point? With laser-focus on methodology. The most successful platform teams focus on three areas.
First, they treat their platform as a product. This approach addresses a common misperception that an application platform is a static, inflexible thing. The reality is that your app platform can be changed incrementally, and enhanced to add greater value over time. Thus, any architectural designs can be enhanced after user feedback is received.
Next, the most successful platform teams invest in automation to reduce human toil. Platform automation helps reduce the operational cost of some of the more complex architectures, such as active-active.
Finally, they rely heavily on metrics and data-driven decision making. Platform metrics and data is a topic we’ll discuss more in a bit. All of these topics are covered in greater depth as part of Pivotal’s Platform Dojos for PCF, which are designed to train platform teams who want to transform how they run software.
4. “Active-active” architectures can make sense when every minute of uptime counts. But consider the trade-offs.
You may be wondering why we don’t default to an “active-active” configuration for PCF. This option features two foundations, each in its own datacenter. This has to be the best option to achieve the highest level of availability possible, right? Not necessarily.
Yes, you are likely to gain an extra 9 of availability. But this doesn’t come for free. There are trade-offs you should carefully consider.
Active-active has additional challenges over active-passive, in terms of operational and capacity costs. For example, how will you mitigate the risk of incongruencies in app data, when the replicated data is being read and written on both sites? These challenges are sometimes worth solving. (And in fact, many of them have been solved in production by companies like yours.)
The point here is to understand the risk-reward ratio. We also aim to use data and metrics as much as possible to pick which architecture will make you happiest over time. The paper goes into more detail about these subjects so that you can have these conversations with your stakeholders.
5. When it comes to backing services, consider your replication needs.
The key question here: what kind of replication of data do your apps need? Do you need to replicate your data across foundations? If not, you have broad flexibility in what service you can expose to your developers.
If you do require replication between two foundations, you have choices here, too. Pivotal Cloud Cache and RabbitMQ support cross-site replication. Partner solutions are also available, like Minio and Redis Labs. All these options offer an efficient way to replicate data without adding downtime.
6. Plan for the future with the right operational data.
This is where platform metrics and data come in handy. Here, there are some new terms, like SLIs, SLOs, and error budgets. These are great tools for tracking whether your platform actually meets your business’ continuity-of-service requirements. If these terms are unfamiliar to you, you’re not alone. They are modern ways to measure the operational performance of distributed systems.
Think of it this way: SLIs, SLOs, and error budgets give you actual metrics to use when your CIO mandates that your services cannot be down for more than a certain amount of time each year.
SLIs and Error Budgets: What These Terms Mean and How They Apply to Your Platform Monitoring Strategy https://t.co/A1wsW7xMaF pic.twitter.com/qSaMTjaUrr
— Kristopher Nelson (@krisnelson42) July 7, 2018
SLIs and SLOs are measures and targets of availability over a period of time, such as the uptime of a particular service or app over a month or a year. Think of error budgets as “downtime allowances” that you can “spend” to accelerate change or experiment with new features. Platform operations teams are finding that this approach helps them support the business more effectively at scale.
You’ll also want to look after familiar metrics, like RTO and RPO. These are useful measures for each incident that affects the availability of your service/app.
For example, let’s say your RTO for a network failure is two hours. So, when the network fails, it will take two hours to restore the network and restore your service for which the RTO applies. If this type of incident happens three times a year, then you can say that your SLO is impacted by this incident at least by six hours each year (two hours per incident, three times a year). So with this incident alone, you are below the 99.99% SLO mark.
Thinking in Error Budgets: How Pivotal’s Cloud Ops Team Used Service Level Objectives and Other Modern SRE Practices to Improve Outcomes <- great case study from Amber Alston and Deborah Wood on how to modernize an Ops teamhttps://t.co/tHavayEhE9
— Pivotal Cloud Foundry (@pivotalcf) September 18, 2018
You’ll want to read more about this topic outside of this post and the paper, but if you have a good understanding of how to use these measurement tools, you’ll feel more confident in your architecture choices and be able to justify the costs, time-to-delivery of updates, etc. with data.
Your Chosen Reference Architecture is Just One Important Part of a Successful Transformation
Now you know a few of the best practices to help you get started with your PCF deployment. Of course, this is just one facet of your platform-as-a-product design. Remember, platform teams must constantly optimize the platform by refining SLA requirements, SLI/SLOs, and continuously learning from experiences and outcomes.
With this guidance in mind, remember that you aren’t confined by your initial decision. Just as your business wants to constantly improve its products and services, you should iteratively improve the platform. You want PCF to continuously improve the value it delivers your internal customers: app developers. That’s why we embrace SLIs and SLOs. These are metrics that help you determine if a change is needed. For instance, if you discover you are “overachieving” your availability requirements, you can make plans to reduce your platform’s redundancy characteristics. If you are underachieving, you can increase redundancy.
Ready to learn more? Download the white paper.
Then, take your knowledge to the next level with this upcoming webinar: Jan 31 – Habits of Highly Effective Platform Teams: Unlocking the Value of PCF Webinar!