I was chatting with a colleague recently, and he mentioned his team had a couple of failing tests. I asked why his build hadn’t picked up on it and he said it had – they knew the build was broken.
I then asked how many tests were failing. He said he thought a couple. When he investigated he found that a couple had turned into half a dozen and they hadn’t noticed.
And that’s why I say zero tolerance for build breaks. As soon as you know there something broken you stop paying as much attention, and so one problem becomes two, five, ten… and before you know it it will take a week of bugfixing before you can even create a stable build.
A broken build should be a stop-the-line situation. Drop everything and get it working again. And that includes test failures. Anything less and you might as well not bother with continuous integration.
This even goes for flickering tests – cases where the build breaks occasionally for no apparent reason. As soon as the developers can’t rely on the build status they will start to ignore it. We had a test that would fail occasionally. I noticed the team just assumed that was the problem when a build break came in and only started investigating if it broke twice in a row. We eventually tracked it down to a test which set up aTCP connection to a specific port. If two builds ran simultaneously, they might both try to use the same port at the same time and one would fail. We updated it to try other ports if the primary one was busy, and since then all breaks have been genuine. The team jumps on them very quickly again.
Zero tolerance. It’s the only way to be sure.