I've never worked on a system level test suite that wasn't flaky. It feels like an eventuality at this point. Starting a Greenfield project where I'd promise myself that this time it won't happen, and after a few months, I am where I was for any other decent sized project: A flaky test. This gave me frustration, lack of confidence in my test suite and felt downright ugly when I tried to patch it up by adding sleep statements.

Later I learned that an integration test suite is bound to be flaky. This is the cost of all the abstractions, distributed nature of today's software, and network calls. Something somewhere doesn't respond according to contract and the lack of resiliency everywhere in the system leads to flaky test suites. It's even more frustrating when errors are unhelpful. 

At Pivotal, the PAS Release Engineering team is responsible for shipping PAS at a sustainable pace with not only the latest patches and security fixes, but large sweeping features that result in refactorings and major additions to the heavily distributed codebase contributed to by over 50 teams.

As a result, our integration test suites exhibit flakiness. I consider anything flaky if I can run the same failing test with the same inputs and get it to green. We experience it more than any other team because of the number of sheer pipelines we have – a direct result of the number products and versions we support. As of today, we are supporting four backport version lines, one forward version across 4-5 different products[1]. Each version and product combination is run through 7 different scenarios, leading to 7 integration test suite runs every single time we want a change to be tested.[2]

Safe to say that our team has seen its fair share of flaky tests in our pipelines. And there is a heavy cost to a red pipeline given the nature of our team. Our team enables our products to go out to customers and if our pipeline is red due to an intermittent failure, it means the assembly line shuts down until human intervention. No bueno!

The Crossroads

I always thought putting the oxygen mask on yourself before a child sitting next to you felt selfish and less altruistic, but my dad taught me why: You don't want to die putting that oxygen mask on. To be capable to help others, you've to first help yourself.

Okay, this is a bit of a stretch while I am talking about intermittent tests, I digress but I will get to it.

Cloud Foundry test suites are written in Go using Ginkgo, which conveniently provides a flag to attempt running a failed test in isolation before it reports a test case as a failure. While it gives you that functionality, the docs warn you not to use this to cover bad tests.

Now, should a team that's a bottleneck to ship features out to customers use that flag in our test suites or not? 

During my tenure of over two years on the team, we've brought up this question over and over as new team members have joined and tried to unravel the answer to this question. 

Why You Should NOT Use the Flake Attempts Flag

Intermittent test suites result from non-resilient software. If there was error handling for every possible scenario in our code, we probably wouldn't have intermittent failures. Every time a failure occurs in the system, it's an opportunity for improvement. It's an opportunity to report to the maintaining team, follow through a Github issue, and see through that failure getting fixed in the test suite, or better yet, in the underlying code. Better yet, it's an opportunity to roll up your sleeves, dive deep into the codebase, and code up a pull request. This is what a good citizen would do.

On our team, every time we brought up this discussion, we had decided not to add this flag in our suites so that we can bubble up the failures to the maintainers and the corresponding component team to get it fixed.

Spoiler Alert: We Actually Added That Flag

After we saw our latest personnel rotations in the team[3], we were bound to have this discussion yet another time. I saw it coming, but one thing was different. We decided to pull the plug and actually add this flag. We had a deliberate discussion on this, here are some reasons:

  1. PAS RelEng team does not maintain the test suite. Our system's goal is to detect faults, unhandled exceptions, and failures in the system under test. Intermittent test failures pollute that goal.

  2. We have added a reporting system using Honeycomb that has the ability to detect failures. These failures are shared with the maintainers automatically.

  3. Reducing the amount of flakiness in our system helps us deliver products to the customers reliably.

Am I happy and satisfied with this? No, but I think it allows us to concentrate at more important problems at hand as Release Engineers: efficiently shipping products out to customers. Handling flakiness PAS as a platform product is a responsibility our team aims to improve upon for the best possible day zero experience.

And to put it bluntly, this is an act of putting the oxygen mask before helping the child.


[1] We dropped releasing PASW-2012 when 2.5 was released.

[2] Of course, there is more to it than that. We run each change through canary pipelines, and only fanout once we are happy with a set of fixes together.

[3] At Pivotal, engineers in teams rotate frequently across different teams. A single tenure on one team most times last 6-9 months.