In this blog post I would like to talk about automated tests and specifically the challenges that arise when the automated tests are unstable. I will discuss this in the context of our development of Log Insight, and its agent.
The Log Insight Agent development team has a large automated test suite composed of thousands of small test cases. At some early stage in our team history we faced a situation when not all of the tests were rock solid. A small portion of the tests were failing intermittently. To understand why this can be a problem let us talk a bit about why we need automated tests in the first place.
The Reason for Automated Tests
There are a number of reasons to have tests, but one of the most important reasons according to continuous integration school is to avoid regressions when the developers make changes. If you have an extensive automated testing suite with good coverage of your product code and features then chances are high that any breaking changes you make to the product code will be caught by automated tests. If you made a change, integrated it with the master branch and a test failed on your continuous integration server then you know that you broke something and need to fix it. This is one of the cornerstones of the continuous integration.
So, what happens when you attempt to practice continuous integration but some of your automated tests are unstable? Imagine that you push a change to the master branch, a job is started by continuous integration server to build and test the product. Suddenly you receive an email, which says a test failed and you are the likely culprit since it was fine before your change. You rush to figure out what exactly went wrong and what you broke, investigate the failed test only to find out that the failure has nothing to do with your change. Something completely unrelated to your changed area failed and the failure is a result of a badly written test code (we will get to the possible reasons for intermittent test failures in my next post). You realize that the test failure was a false alarm and you didn’t break anything. You re-instate the build and test jobs to confirm your hypothesis and as you expect this time all the tests pass without failures. Your newly submitted change was fine however you have just spent an hour investigating an intermittent failure in an automated test written by someone else only to find out it was a waste of time.
As time goes by your team will produce more and more automated tests and if not all tests are 100% stable then more and more frequently once in a while at least one test will produce a false alarm. What is worse you never know in advance which failures are false alarms and which failures are real product problems. If a test fails intermittently it may be an indication of a bad test but it may also be an intermittently appearing product bug that actually needs to be investigated and fixed. This requires you to analyze all test failures to rule out product bugs even if the test that failed is known to be unstable (who knows why it failed this time). So, every test failure inevitably takes time and efforts from your team.
We will now do a short math exercise to understand how much of a problem it can really become over time.
Let us for example assume we have a team of 10 developers and each developer creates 3 new test cases each week. Let us further assume that for whatever reasons 1 out of 10 new tests is slightly unstable and has a chance to fail once in 100 executions, i.e. the probability of failure is 0.01.
After a week of development we will have 30 tests of which 3 are unstable. Assuming that individual test failures probabilities are independent we can calculate the probability of entire test suite failure. The chance that a job that executes the entire test suite will fail is 1-(1-0.01)^3 = 0.0297 or roughly 3% (see the footnote  below for probability calculation details). In other words about one in 33 executions of the job will have at least one test case failing. Not great but we can probably live with that. Let us see what happens if our team continues working at the same pace and creates 3 new unstable tests every week. After 4 weeks we will have 12 unstable tests. The chance that at least one test will fail during test job execution is 1-(1-0.01)^12 = 0.1136. In other words on average one out of 9 job executions will fail on our continuous execution server. That is no longer nice but maybe we still can tolerate it? We decide to accept that and continue working. After all we are creating such a comprehensive automated test suite!
I wrote a few lines of Python code to calculate the probability of test suite failure over time:
DEVS = 10 TESTS_PER_WEEK_PER_DEV = 3 UNSTABLE_PORTION = 0.1 TEST_FAILURE_CHANCE = 0.01 for week in range(1, 52): total_unstable_count = week * DEVS * TESTS_PER_WEEK_PER_DEV * UNSTABLE_PORTION suite_failure_chance = 1 - pow(1 - TEST_FAILURE_CHANCE, total_unstable_count) print week, "t", suite_failure_chance
I then used the results to create this chart:
As expected we get higher and higher chances of failure as time goes by. In less than 6 months we get to the point where every other test job fails (50% on the chart) and in a year we get to almost 80% failure chance. This means that almost every push of a change to the master branch results in a failure regardless of who did the push and whether the change actually broke anything. It is no longer possible to know if a change is good or bad. The change could be bad and have broken real functionality or it could be perfectly fine.
In reality if you allow your tests to be unstable as soon as you get to about 10-20% chance of test job failure the team will likely start ignoring the failures. People will receive failure reports so often and so many of them will be false alarms that the team will no longer care about them because the reports are not trustworthy. This is a bad situation to be in for a team that wants to do continuous integration.
So, here is an advice. Don’t let this happen to your team. Test stability is important in continuous integration environment. In my next post I will talk about how to avoid test instability and how to regain the stability if for some reason you already crossed the instability line.
And don’t forget to try Log Insight when you get a chance.
 Here is the probability calculation for the mathematically inclined:
To ease the calculation we will use probability of success for intermediate results. If the probability of a single unstable test failure is 0.01 then the probability of a single test to succeed is 1-0.01 = 0.99. If we have 3 independent unstable tests each having probability of success of 0.99 then the probability that all 3 will succeed according to the multiplication rule is 0.99*0.99*0.99 = 0.970299. Hence the probability that any of the 3 tests will fail is 1-0.970299 = 0.029701.