capybara continuous_integration labs rails

Combating false positive failures in CI

code {
color: inherit;
background-color: inherit;
padding: inherit;
}
h3 {
font-weight: bold;
}

Scope and the problem

If you experience a CI build turning red due to an intermittent failure, such as in a Capybara test, this post is for you. During my last encounter with such issue, the team and I wanted to strengthen the meaning behind that bright red color. Red build should be an actionable failure indicating a real fault in our application.

Source of inaccuracy

In the context of Rails applications, false positive build failures most often occur from Capybara acceptance tests. In it’s implementation, capybara relies on a variety of timeouts and waits, which may or may not delay long enough for the acceptable behaviour to take place. This is even more common when CI runs on slower hardware such as a small EC2 instance.

Step 1: rspec-rerun and the controvercy

The first and most impactful step we took was to use rspec-rerun gem which, in its rake task, automatically reruns failed tests until either they pass or a limit of retries is reached.

Using rspec-rerun is fairly simple. Add it to your Gemfile:

group :development, :test do
  gem "rspec-rerun"
end

a rake task for your CI:

RSPEC_RERUN_RETRY_COUNT=3 rake rspec-rerun:spec

Results:

rspec-rerun immediately lowered the number of false positive build outcomes. The impact of this was huge:

  • Improved signal-to-noise ratio caused us to pay attention to the build much more
  • When a build was red, it felt more justifiable to stop the immediate task and investigate the failure, since it was more likely to be consistent and indicative of a quality issue.

Controvercy

rspec-rerun approach to flaky tests is similar to that of strategies other teams have tried. One other technique is to manually flag tests as flaky and run them in a separate build.

Given enough skill with Capybara, it is possible to argue that most tests can be fixed instead of merely retried. On the other hand, even with skill, this costs engineering time and it may be worthwhile to pursue other changes to the system.

A measurable aspect of rspec-rerun and it’s impact on our tests is it’s cost in terms of total build duration. If we were to rerun tests, we wanted at least some view into the impact of false positives on the build duration.

Step 2: Tracking incidence of false positives over time

We wrote a gem to track how often a test shows up as a false positive, via logging rspec data. Here’s how it works:

rspec_log_formatter gem comes as a pair of 2 formatters:

RSpec.configure do |config|
  config.formatters << RspecLogFormatter::Formatter::Factory.new.build
  config.formatters << RspecLogFormatter::AnalyzerFormatter::Factory.new(
      max_reruns: 3, #ENV['RSPEC_RERUN_RETRY_COUNT'] is also a great choice
      builds_to_analyze: 5
  ).build
end

RspecLogFormatter::Formatter appends to a file in your rails directory (rspec.history), storing a history of pass/failure outcomes that can then be analyzed for many purposes.

RspecLogFormatter::AnalyzerFormatter calculates false-positive rates and can report the likelihood of failure along with the estimated duration penalty to the test suite.

Output includes the following table:

Top 6 flakiest examples
  1) Spec 1 -- 50% (cost: 10.3s)
    * RuntimeError
  1) Spec 2 -- 30% (cost: 8.4s)
    * RuntimeError
  1) Spec 3 -- 34% (cost: 6.7s)
    * RuntimeError
  1) Spec 4 -- 21% (cost: 4.6s)
    * RuntimeError
  1) Spec 5 -- 22% (cost: 3.9s)
    * RuntimeError
  1) Spec 6 -- 12% (cost: 1.7s)
    * RuntimeError

Which indicates that the most unreliable test would, on average, slow down the whole suite by 10 seconds due to occasionally being re-ran multiple times.

How we estimate time cost

Given a history of passes and failures of a test we use the last pass and last failure of any particular example to estimate how long a test takes to pass and how long it takes to fail. Here are the parameters for a single rspec example:

Parameters of the calculation

With rspec-rerun, an example can either pass, or fail-then-pass, or fail-then-fail-then-pass, etc. Each scenario has a probability and cost and the series sum works out to be(Thanks to Amit Gupta for correcting my math):

Weighted sum

When to act on this data

What I like about this, is that it collects the data and leaves it completely to your team to act upon it. In our case, we occasionaly looked for most costly examples and prioritized a task to make that example more reliable. And eventually, once all flaky examples have been made more reliable, you can take this gem out and enjoy a reliable test suite.

When to act on this data

Learn more & respond

The project is on github. Please follow it there for new features, changes.

What do you think of this tool? Do you love it? Do you hate it? Let me know in the comments.